When can 64-bit writes be guaranteed to be atomic, when programming in C on an Intel x86-based platform (in particular, an Intel-based Mac running MacOSX 10.4 using the Intel compiler)? For example:
unsigned long long int y;
y = 0xfedcba87654321ULL;
/* ... a bunch of other time-consuming stuff happens... */
y = 0x12345678abcdefULL;
If another thread is examining the value of y after the first assignment to y has finished executing, I would like to ensure that it sees either the value 0xfedcba87654321 or the value 0x12345678abcdef, and not some blend of them. I would like to do this without any locking, and if possible without any extra code. My hope is that, when using a 64-bit compiler (the 64-bit Intel compiler), on an operating system capable of supporting 64-bit code (MacOSX 10.4), that these 64-bit writes will be atomic. Is this always true?
Your best bet is to avoid trying to build your own system out of primitives, and instead use locking unless it really shows up as a hot spot when profiling. (If you think you can be clever and avoid locks, don't. You aren't. That's the general "you" which includes me and everybody else.) You should at minimum use a spin lock, see spinlock(3). And whatever you do, don't try to implement "your own" locks. You will get it wrong.
Ultimately, you need to use whatever locking or atomic operations your operating system provides. Getting these sorts of things exactly right in all cases is extremely difficult. Often it can involve knowledge of things like the errata for specific versions of specific processor. ("Oh, version 2.0 of that processor didn't do the cache-coherency snooping at the right time, it's fixed in version 2.0.1 but on 2.0 you need to insert a NOP.") Just slapping a volatile keyword on a variable in C is almost always insufficient.
On Mac OS X, that means you need to use the functions listed in atomic(3) to perform truly atomic-across-all-CPUs operations on 32-bit, 64-bit, and pointer-sized quantities. (Use the latter for any atomic operations on pointers so you're 32/64-bit compatible automatically.) That goes whether you want to do things like atomic compare-and-swap, increment/decrement, spin locking, or stack/queue management. Fortunately the spinlock(3), atomic(3), and barrier(3) functions should all work correctly on all CPUs that are supported by Mac OS X.
On x86_64, both the Intel compiler and gcc support some intrinsic atomic-operation functions. Here's gcc's documentation of them: http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html
The Intel compiler docs also talk about them here: http://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/347603.pdf (page 164 or thereabouts).
According to Chapter 7 of Part 3A - System Programming Guide of Intel's processor manuals, quadword accesses will be carried out atomically if aligned on a 64-bit boundary, on a Pentium or newer, and unaligned (if still within a cache line) on a P6 or newer. You should use volatile to ensure that the compiler doesn't try to cache the write in a variable, and you may need to use a memory fence routine to ensure that the write happens in the proper order.
If you need to base the value written on an existing value, you should use your operating system's Interlocked features (e.g. Windows has InterlockedIncrement64).
On Intel MacOSX, you can use the built-in system atomic operations. There isn't a provided atomic get or set for either 32 or 64 bit integers, but you can build that out of the provided CompareAndSwap. You may wish to search XCode documentation for the various OSAtomic functions. I've written the 64-bit version below. The 32-bit version can be done with similarly named functions.
#include <libkern/OSAtomic.h>
// bool OSAtomicCompareAndSwap64Barrier(int64_t oldValue, int64_t newValue, int64_t *theValue);
void AtomicSet(uint64_t *target, uint64_t new_value)
{
while (true)
{
uint64_t old_value = *target;
if (OSAtomicCompareAndSwap64Barrier(old_value, new_value, target)) return;
}
}
uint64_t AtomicGet(uint64_t *target)
{
while (true)
{
int64 value = *target;
if (OSAtomicCompareAndSwap64Barrier(value, value, target)) return value;
}
}
Note that Apple's OSAtomicCompareAndSwap functions atomically perform the operation:
if (*theValue != oldValue) return false;
*theValue = newValue;
return true;
We use this in the example above to create a Set method by first grabbing the old value, then attempting to swap the target memory's value. If the swap succeeds, that indicates that the memory's value is still the old value at the time of the swap, and it is given the new value during the swap (which itself is atomic), so we are done. If it doesn't succeed, then some other thread has interfered by modifying the value in-between when we grabbed it and when we tried to reset it. If that happens, we can simply loop and try again with only minimal penalty.
The idea behind the Get method is that we can first grab the value (which may or may not be the actual value, if another thread is interfering). We can then try to swap the value with itself, simply to check that the initial grab was equal to the atomic value.
I haven't checked this against my compiler, so please excuse any typos.
You mentioned OSX specifically, but in case you need to work on other platforms, Windows has a number of Interlocked* functions, and you can search the MSDN documentation for them. Some of them work on Windows 2000 Pro and later, and some (particularly some of the 64-bit functions) are new with Vista. On other platforms, GCC versions 4.1 and later have a variety of __sync* functions, such as __sync_fetch_and_add(). For other systems, you may need to use assembly, and you can find some implementations in the SVN browser for the HaikuOS project, inside src/system/libroot/os/arch.
On X86, the fastest way to atomically write an aligned 64-bit value is to use FISTP. For unaligned values, you need to use a CAS2 (_InterlockedExchange64). The CAS2 operation is quite slow due to BUSLOCK though so it can often be faster to check alignment and do the FISTP version for aligned addresses. Indeed, this is how the Intel Threaded building Blocks implements Atomic 64-bit writes.
The latest version of ISO C (C11) defines a set of atomic operations, including atomic_store(_explicit). See e.g. this page for more information.
The second most portable implementation of atomics are the GCC intrinsics, which have already been mentioned. I find that they are fully supported by GCC, Clang, Intel, and IBM compilers, and - as of the last time I checked - partially supported by the Cray compilers.
One clear advantage of C11 atomics - in addition to the whole ISO standard thing - is that they support a more precise memory consistency prescription. The GCC atomics imply a full memory barrier as far as I know.
If you want to do something like this for interthread or interprocess communication, then you need to have more than just an atomic read/write guarantee. In your example, it appears that you want the values written to indicate that some work is in progress and/or has been completed. You will need to do several things, not all of which are portable, to ensure that the compiler has done things in the order you want them done (the volatile keyword may help to a certain extent) and that memory is consistent. Modern processors and caches can perform work out of order unbeknownst to the compiler, so you really need some platform support (ie., locks or platform-specific interlocked APIs) to do what it appears you want to do.
"Memory fence" or "memory barrier" are terms you may want to research.
GCC has intrinsics for atomic operations; I suspect you can do similar with other compilers, too. Never rely on the compiler for atomic operations; optimization will almost certainly run the risk of making even obviously atomic operations into non-atomic ones unless you explicitly tell the compiler not to do so.
Related
I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}
It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.
Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.
Let me explain what I mean by data consistency issue. Take following scenario for example
uint16 x,y;
x=0x01FF;
y=x;
Clearly, these variables are 16 bit but if an 8 bit CPU is used with this code, read or write operations would not be atomic. Thereby an interrupt can occur in between and change the value.This is one situation which MIGHT lead to data inconsistency.
Here's another example,
if(x>7) //x is global variable
{
switch(x)
{
case 8://do something
break;
case 10://do something
break;
default: //do default
}
}
In the above excerpt code, if an interrupt is changing the value of x from 8 to 5 after the if statement but before the switch statement,we end up in default case, instead of case 8.
Please note, I'm looking for ways to detect such scenarios (but not solutions)
Are there any tools that can detect such issues in Embedded C?
It is possible for a static analysis tool that is context (thread/interrupt) aware to determine the use of shared data, and that such a tool could recognise specific mechanisms to protect such data (or lack thereof).
One such tool is Polyspace Code Prover; it is very expensive and very complex, and does a lot more besides that described above. Specifically to quote (elided) from the whitepaper here:
With abstract interpretation the following program elements are interpreted in new ways:
[...]
Any global shared data may change at any time in a multitask program, except when protection
mechanisms, such as memory locks or critical sections, have been applied
[...]
It may have improved in the long time since I used it, but one issue I had was that it worked on a lock-access-unlock idiom, where you specified to the tool what the lock/unlock calls or macros were. The problem with that is that the C++ project I worked on used a smarter method where a locking object (mutex, scheduler-lock or interrupt disable for example) locked on instantiation (in the constructor) and unlocked in the destructor so that it unlocked automatically when the object went out of scope (a lock by scope idiom). This meant that the unlock was implicit and invisible to Polyspace. It could however at least identify all the shared data.
Another issue with the tool is that you must specify all thread and interrupt entry points for concurrency analysis, and in my case these were private-virtual functions in task and interrupt classes, again making them invisible to Polyspace. This was solved by conditionally making the entry-points public for the abstract analysis only, but meant that the code being tested does not have the exact semantics of the code to be run.
Of course these are non-problems for C code, and in my experience Polyspace is much more successfully applied to C in any case; you are far less likely to be writing code in a style to suit the tool rather than the tool working with your existing code-base.
There are no such tools as far as I am aware. And that is probably because you can't detect them.
Pretty much every operation in your C code has the potential to get interrupted before it is finished. Less obvious than the 16 bit scenario, there is also this:
uint8_t a, b;
...
a = b;
There is no guarantees that this is atomic! The above assignment may as well translate to multiple assembler instructions, such as 1) load a into register, 2) store register at memory address. You can't know this unless you disassemble the C code and check.
This can create very subtle bugs. Any assumption of the kind "as long as I use 8 bit variables on my 8 bit CPU, I can't get interrupted" is naive. Even if such code would result in atomic operations on a given CPU, the code is non-portable.
The only reliable, fully-portable solution is to use some manner of semaphore. On embedded systems, this could be as simple as a bool variable. Another solution is to use inline assembler, but that can't be ported cross platform.
To solve this, C11 introduced the qualifier _Atomic to the language. However, C11 support among embedded systems compilers is still mediocre.
I have a multithreaded application where I one producer thread(main) and multiple consumers.
Now from main I want to have some sort of percentage of how far into the work the consumers are. Implementing a counter is easy as the work that is done a loop. However since this loop repeats a couple of thousands of times, maybe even more than a million times. I don`t want to mutex this part. So I went looking into some atomic options of writing to an int.
As far as I understand I can use the builtin atomic functions from gcc:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
however, it doesn`t have a function for just reading the variable I want to work on.
So basically my question is.
can I read from the variable safely from my producer, as long as I use the atomic builtins for writing to that same variable in the consumer
or
do I need some sort of different function to read from the variable. and what function is that
Define "safely".
If you just use a regular read, on x86, for naturally aligned 32-bit or smaller data, the read is atomic, so you will always read a valid value rather than one containing some bytes written by one thread and some by another. If any of those things are not true (not x86, not naturally aligned, larger than 32 bits...) all bets are off.
That said, you have no guarantee whatsoever that the value read will be particularly fresh, or that the sequence of values seen over multiple reads will be in any particular order. I have seen naive code using volatile to defeat the compiler optimising away the read entirely but no other synchronisation mechanism, literally never see an updated value due to CPU caching.
If any of these things matter to you, and they really should, you should explicitly make the read atomic and use the appropriate memory barriers. The intrinsics you refer to take care of both of these things for you: you could call one of the atomic intrinsics in such a way that there is no side effect other than returning the value:
__sync_val_compare_and_swap(ptr, 0, 0)
or
__sync_add_and_fetch(ptr, 0)
or
__sync_sub_and_fetch(ptr, 0)
or whatever
If your compiler supports it, you can use C11 atomic types. They are introduced in the section 7.17 of the standard, but they are unfortunately optional, so you will have to check whether __STDC_NO_ATOMICS__ is defined to at least throw a meaningful error if it's not supported.
With gcc, you apparently need at least version 4.9, because otherwise the header is missing (here is a SO question about this, but I can't verify because I don't have GCC-4.9).
I'll answer your question, but you should know upfront that atomics aren't cheap. The CPU has to synchronize between cores every time you use atomics, and you won't like the performance results if you use atomics in a tight loop.
The page you linked to lists atomic operations for the writer, but says nothing about how such variables should be read. The answer is that your other CPU cores will "see" the updated values, but your compiler may "cache" the old value in a register or on the stack. To prevent this behavior, I suggest you declare the variable volatile to force your compiler not to cache the old value.
The only safety issue you will encounter is stale data, as described above.
If you try to do anything more complex with atomics, you may run into subtle and random issues with the order atomics are written to by one thread versus the order you see those changes in another thread. Unfortunately you're not using a built-in language feature, and the compiler builtins aren't designed perfectly. If you choose to use these builtins, I suggest you keep your logic very simple.
If I understood the problem, I would not use any atomic variable for the counters. Each worker thread can have a separate counter that it updates locally, the master thread can read the whole array of counters for an approximate snapshot value, so this becomes a 1 consumer 1 producer problem. The memory can be made visible to the master thread, for example, every 5 seconds, by using __sync_synchronize() or similar.
As per my knowledge msse and msse2 option of gcc will improve the performance by performing arithmetic operation faster. And also I read some where like it will use more resources like registers, cache memory.
What about the performance if we use the executable generated with these options on RTOS devices(like vxworks board) ?
The OS must support SSE(2) instructions for your application to work correctly. It would seem, from googling, that VcWorks supports this (and it's not really that hard, all it takes is that the OS has a 512 byte save-area per task that uses SSE/SSE2 - given the right circumstances, it can be allocated on demand, but it's often easier to just allocate it to all tasks]. Saving/restoring SSE registers is done "on demand", that is, only when a task different from the previous one to use SSE is using SSE instructions, is it necessary to save the registers. The OS will use a special interrupt(trap) to indicate that "a new task is trying to use SSE instructions.
So, as long as the processor supports it, you should be fine.
I may not be able to directly answer your question, but here are a couple things I do know that may be of use:
SSE, SSE2, etc. must be supported/implemented by the processor for them to have any affect in the first place.
There are specific functions you can call that use these extended instructions for mathematical operations. These functions operate on wider data types or perform an operation on a set efficiently.
Enabling the options in GCC may use the previous APIs/builtins automatically. This is the part I am unsure about.
I'm working on an embedded project (PowerPC target, Freescale Metrowerks Codewarrior compiler) where the registers are memory-mapped and defined in nice bitfields to make twiddling the individual bit flags easy.
At the moment, we are using this feature to clear interrupt flags and control data transfer. Although I haven't noticed any bugs yet, I was curious if this is safe. Is there some way to safely use bit fields, or do I need to wrap each in DISABLE_INTERRUPTS ... ENABLE_INTERRUPTS?
To clarify: the header supplied with the micro has fields like
union {
vuint16_t R;
struct {
vuint16_t MTM:1; /* message buffer transmission mode */
vuint16_t CHNLA:1; /* channel assignement */
vuint16_t CHNLB:1; /* channel assignement */
vuint16_t CCFE:1; /* cycle counter filter enable */
vuint16_t CCFMSK:6; /* cycle counter filter mask */
vuint16_t CCFVAL:6; /* cycle counter filter value */
} B;
} MBCCFR;
I assume setting a bit in a bitfield is not atomic. Is this a correct assumption? What kind of code does the compiler actually generate for bitfields? Performing the mask myself using the R (raw) field might make it easier to remember that the operation is not atomic (it is easy to forget that an assignment like CAN_A.IMASK1.B.BUF00M = 1 isn't atomic).
Your advice is appreciated.
Atomicity depends on the target and the compiler. AVR-GCC for example trys to detect bit access and emit bit set or clear instructions if possible. Check the assembler output to be sure ...
EDIT: Here is a resource for atomic instructions on PowerPC directly from the horse's mouth:
http://www.ibm.com/developerworks/library/pa-atom/
It is correct to assume that setting bitfields is not atomic. The C standard isn't particularly clear on how bitfields should be implemented and various compilers go various ways on them.
If you really only care about your target architecture and compiler, disassemble some object code.
Generally, your code will achieve the desired result but be much less efficient than code using macros and shifts. That said, it's probably more readable to use your bit fields if you don't care about performance here.
You could always write a setter wrapper function for the bits that is atomic, if you're concerned about future coders (including yourself) being confused.
Yes, your assumption is correct, in the sense that you may not assume atomicity. On a specific platform you might get it as an extra, but you can't rely on it in any case.
Basically the compiler performs masking and things for you. He might be able to take advantage of corner cases or special instructions. If you are interested in efficiency look into the assembler that your compiler produces with that, usually it is quite instructive. As a rule of thumb I'd say that modern compilers produces code that is as efficient as medium programming effort would be. Real deep bit twiddeling for your specific compiler could perhaps gain you some cycles.
I think that using bitfields to model hardware registers is not a good idea.
So much about how bitfields are handled by a compiler is implementation-defined (including how fields that span byte or word boundaries are handled, endianess issues, and exactly how getting, setting and clearing bits is implemented). See C/C++: Force Bit Field Order and Alignment
To verify that register accesses are being handled how you might expect or need them to be handled, you would have to carefully study the compiler docs and/or look at the emitted code. I suppose that if the headers supplied with the microprocessor toolset uses them you can be assume that most of my concerns are taken care of. However, I'd guess that atomic access isn't necessarily...
I think it's best to handle these type of bit-level accesses of hardware registers using functions (or macros, if you must) that perform explicit read/modify/write operations with the bit mask that you need, if that's what your processor requires.
Those functions could be modified for architectures that support atomic bit-level accesses (such as the ARM Cortex M3's "bit-banding" addressing). I don't know if the PowerPC supports anything like this - the M3 is the only processor I've dealt with that supports it in a general fashion. And even the M3's bit-banding supports 1-bit accesses; if you're dealing with a field that's 6-bits wide, you have to go back to the read/modify/write scenario.
It totally depends on the architecture and compiler whether the bitfield operations are atomic or not. My personal experience tells: don't use bitfields if you don't have to.
I'm pretty sure that on powerpc this is not atomic, but if your target is a single core system then you can just:
void update_reg_from_isr(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
unsigned reg = *reg_addr;
reg |= set;
reg &= ~clear;
reg ^= toggle;
*reg_addr = reg;
}
void update_reg(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
interrupts_block();
update_reg_from_isr(reg_addr, set, clear, toggle);
interrupts_enable();
}
I don't remember if powerpc's interrupt handlers are interruptible, but if they are then you should just use the second version always.
If your target is a multiprocessor system then you should make locks (spinlocks, which disable interrupts on the local processor and then wait for any other processors to finish with the lock) that protect access to things like hardware registers, and acquire the needed locks before you access the register, and then release the locks immediately after you have finished updating the register (or registers).
I read once how to implement locks in powerpc -- it involved telling the processor to watch the memory bus for a certain address while you did some operations and then checking back at the end of those operations to see if the watch address had been written to by another core. If it hadn't then your operation was sucessful; if it had then you had to redo the operation. This was in a document written for compiler, library, and OS developers. I don't remember where I found it (probably somewhere on IBM.com) but a little hunting should turn it up. It probably also has info on how to do atomic bit twiddling.