What does the C compiler do with bitfields? - c

I'm working on an embedded project (PowerPC target, Freescale Metrowerks Codewarrior compiler) where the registers are memory-mapped and defined in nice bitfields to make twiddling the individual bit flags easy.
At the moment, we are using this feature to clear interrupt flags and control data transfer. Although I haven't noticed any bugs yet, I was curious if this is safe. Is there some way to safely use bit fields, or do I need to wrap each in DISABLE_INTERRUPTS ... ENABLE_INTERRUPTS?
To clarify: the header supplied with the micro has fields like
union {
vuint16_t R;
struct {
vuint16_t MTM:1; /* message buffer transmission mode */
vuint16_t CHNLA:1; /* channel assignement */
vuint16_t CHNLB:1; /* channel assignement */
vuint16_t CCFE:1; /* cycle counter filter enable */
vuint16_t CCFMSK:6; /* cycle counter filter mask */
vuint16_t CCFVAL:6; /* cycle counter filter value */
} B;
} MBCCFR;
I assume setting a bit in a bitfield is not atomic. Is this a correct assumption? What kind of code does the compiler actually generate for bitfields? Performing the mask myself using the R (raw) field might make it easier to remember that the operation is not atomic (it is easy to forget that an assignment like CAN_A.IMASK1.B.BUF00M = 1 isn't atomic).
Your advice is appreciated.

Atomicity depends on the target and the compiler. AVR-GCC for example trys to detect bit access and emit bit set or clear instructions if possible. Check the assembler output to be sure ...
EDIT: Here is a resource for atomic instructions on PowerPC directly from the horse's mouth:
http://www.ibm.com/developerworks/library/pa-atom/

It is correct to assume that setting bitfields is not atomic. The C standard isn't particularly clear on how bitfields should be implemented and various compilers go various ways on them.
If you really only care about your target architecture and compiler, disassemble some object code.
Generally, your code will achieve the desired result but be much less efficient than code using macros and shifts. That said, it's probably more readable to use your bit fields if you don't care about performance here.
You could always write a setter wrapper function for the bits that is atomic, if you're concerned about future coders (including yourself) being confused.

Yes, your assumption is correct, in the sense that you may not assume atomicity. On a specific platform you might get it as an extra, but you can't rely on it in any case.
Basically the compiler performs masking and things for you. He might be able to take advantage of corner cases or special instructions. If you are interested in efficiency look into the assembler that your compiler produces with that, usually it is quite instructive. As a rule of thumb I'd say that modern compilers produces code that is as efficient as medium programming effort would be. Real deep bit twiddeling for your specific compiler could perhaps gain you some cycles.

I think that using bitfields to model hardware registers is not a good idea.
So much about how bitfields are handled by a compiler is implementation-defined (including how fields that span byte or word boundaries are handled, endianess issues, and exactly how getting, setting and clearing bits is implemented). See C/C++: Force Bit Field Order and Alignment
To verify that register accesses are being handled how you might expect or need them to be handled, you would have to carefully study the compiler docs and/or look at the emitted code. I suppose that if the headers supplied with the microprocessor toolset uses them you can be assume that most of my concerns are taken care of. However, I'd guess that atomic access isn't necessarily...
I think it's best to handle these type of bit-level accesses of hardware registers using functions (or macros, if you must) that perform explicit read/modify/write operations with the bit mask that you need, if that's what your processor requires.
Those functions could be modified for architectures that support atomic bit-level accesses (such as the ARM Cortex M3's "bit-banding" addressing). I don't know if the PowerPC supports anything like this - the M3 is the only processor I've dealt with that supports it in a general fashion. And even the M3's bit-banding supports 1-bit accesses; if you're dealing with a field that's 6-bits wide, you have to go back to the read/modify/write scenario.

It totally depends on the architecture and compiler whether the bitfield operations are atomic or not. My personal experience tells: don't use bitfields if you don't have to.

I'm pretty sure that on powerpc this is not atomic, but if your target is a single core system then you can just:
void update_reg_from_isr(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
unsigned reg = *reg_addr;
reg |= set;
reg &= ~clear;
reg ^= toggle;
*reg_addr = reg;
}
void update_reg(unsigned * reg_addr, unsigned set, unsigned clear, unsigned toggle) {
interrupts_block();
update_reg_from_isr(reg_addr, set, clear, toggle);
interrupts_enable();
}
I don't remember if powerpc's interrupt handlers are interruptible, but if they are then you should just use the second version always.
If your target is a multiprocessor system then you should make locks (spinlocks, which disable interrupts on the local processor and then wait for any other processors to finish with the lock) that protect access to things like hardware registers, and acquire the needed locks before you access the register, and then release the locks immediately after you have finished updating the register (or registers).
I read once how to implement locks in powerpc -- it involved telling the processor to watch the memory bus for a certain address while you did some operations and then checking back at the end of those operations to see if the watch address had been written to by another core. If it hadn't then your operation was sucessful; if it had then you had to redo the operation. This was in a document written for compiler, library, and OS developers. I don't remember where I found it (probably somewhere on IBM.com) but a little hunting should turn it up. It probably also has info on how to do atomic bit twiddling.

Related

C99 "atomic" load in baremetal portable library

I'm working on a portable library for baremetal embedded applications.
Assume that I have a timer ISR that increments a counter and, in the main loop, this counter read is from in a most certainly not atomic load.
I'm trying to ensure load consistency (i.e. that I'm not reading garbage because the load was interrupted and the value changed) without resorting to disabling interrupts. It does not matter if the value changed after reading the counter as long as the read value is proper. Does this do the trick?
uint32_t read(volatile uint32_t *var){
uint32_t value;
do { value = *var; } while(value != *var);
return value;
}
It's highly unlikely that there's any sort of a portable solution for this, not least because plenty of C-only platforms are really C-only and use one-off compilers, i.e. nothing mainstream and modern-standards-compliant like gcc or clang. So if you're truly targeting entrenched C, then it's all quite platform-specific and not portable - to the point where "C99" support is a lost cause. The best you can expect for portable C code is ANSI C support - referring to the very first non-draft C standard published by ANSI. That is still, unfortunately, the common denominator - that major vendors get away with. I mean: Zilog somehow gets away with it, even if they are now but a division of Littelfuse, formerly a division of IXYS Semiconductor that Littelfuse had acquired.
For example, here are some compilers where there's only a platform-specific way of doing it:
Zilog eZ8 using a "recent" Zilog C compiler (anything 20 years old or less is OK): 8-bit value read-modify-write is atomic. 16-bit operations where the compiler generates word-aligned word instructions like LDWX, INCW, DECW are atomic as well. If the read-modify-write otherwise fits into 3 instructions or less, you'd prepend the operation with asm("\tATM");. Otherwise, you'd need to disable the interrupts: asm("\tPUSHF\n\tDI");, and subsequently re-enable them: asm("\tPOPF");.
Zilog ZNEO is a 16 bit platform with 32-bit registers, and read-modify-write accesses on registers are atomic, but memory read-modify-write round-trips via a register, usually, and takes 3 instructions - thus prepend the R-M-W operation with asm("\tATM").
Zilog Z80 and eZ80 require wrapping the code in asm("\tDI") and asm("\tEI"), although this is valid only when it's known that the interrupts are always enabled when your code runs. If they may not be enabled, then there's a problem since Z80 does not allow reading the state of IFF1 - the interrupt enable flip-flop. So you'd need to save a "shadow" of its state somewhere, and use that value to conditionally enable interrupts. Unfortunately, eZ80 does not provide an interrupt controller register that would allow access to IEF1 (eZ80 uses the IEFn nomenclature instead of IFFn) - so this architectural oversight is carried over from the venerable Z80 to the "modern" one.
Those aren't necessarily the most popular platforms out there, and many people don't bother with Zilog compilers due to their fairly poor quality (low enough that yours truly had to write an eZ8-targeting compiler*). Yet such odd corners are the mainstay of C-only code bases, and library code has no choice but to accommodate this, if not directly then at least by providing macros that can be redefined with platform-specific magic.
E.g. you could provide empty-by-default macros MYLIB_BEGIN_ATOMIC(vector) and MYLIB_END_ATOMIC(vector) that would be used to wrap code that requires access atomic with respect to a given interrupt vector (or e.g. -1 if with respect to all interrupt vectors). Naturally, replace MYLIB_ with a "namespace" prefix specific to your library.
To enable platform-specific optimizations such as ATM vs DI on "modern" Zilog platforms, an additional argument could be provided to the macro to separate the presumed "short" sequences that the compiler is apt to generate three-instruction sequences for vs. longer ones. Such micro-optimization requires usually an assembly output audit (easily automatable) to verify the assumption of the instruction sequence length, but at least the data to drive the decision would be available, and the user would have a choice of using it or ignoring it.
*If some lost soul wants to know anything bordering on the arcane re. eZ8 - ask away. I know entirely too much about that platform, in details so gory that even modern Hollywood CG and SFX would have a hard time reproducing the true depth of the experience on-screen. I'm also possibly the only one out there running the 20MHz eZ8 parts occasionally at 48MHz clock - as sure a sign of demonic possession as the multiverse allows. If you think it's outrageous that such depravity makes it into production hardware - I'm with you. Alas, business case is business case, laws of physics be damned.
Are you running on any systems that have uint32_t larger than a single assembly instruction word read/write size? If not, the IO to memory should be a single instructions and therefore atomic (assuming the bus is also word sized...) You get in trouble when the compiler breaks it up into multiple smaller read/writes. Otherwise, I've always had to resort to DI/EI. You could have the user configure your library such that it has information if atomic instructions or minimum 32-bit word size are available to prevent interrupt twiddling. If you have these guarantees, you don't need to verification code.
To answer the question though, on a system that must split the read/writes, your code is not safe. Imagine a case where you read your value in correctly in the "do" part, but the value gets split during the "while" part check. Further, in an extreme case, this is an infinite loop. For complete safety, you'd need a retry count and error condition to prevent that. The loop case is extreme for sure, but I'd want it just in case. That of course makes the run time longer.
Let's show a failure case for examples - will use 16-bit numbers on a machine that reads 8-bit values at a time to make it easier to follow:
Value to read from memory *var is 0x1234
Read 8-bit 0x12
*var becomes 0x5678
Read 8-bit 0x78 - value is now 0x1278 (invalid)
*var becomes 0x1234
Verification step reads 8-bit 0x12
*var becomes 0x5678
Verification reads 8-bit 0x78
Value confirmed correct 0x1278, but this is an error as *var was only 0x1234 and 0x5678.
Another failure case would be when *var just happens to change at the same frequency as your code is running, which could lead to an infinite loop as each verification fails. Or even if it did break out eventually, this would be a very hard to track performance bug.

How to find issues related to Data consistency in an Embedded C code base?

Let me explain what I mean by data consistency issue. Take following scenario for example
uint16 x,y;
x=0x01FF;
y=x;
Clearly, these variables are 16 bit but if an 8 bit CPU is used with this code, read or write operations would not be atomic. Thereby an interrupt can occur in between and change the value.This is one situation which MIGHT lead to data inconsistency.
Here's another example,
if(x>7) //x is global variable
{
switch(x)
{
case 8://do something
break;
case 10://do something
break;
default: //do default
}
}
In the above excerpt code, if an interrupt is changing the value of x from 8 to 5 after the if statement but before the switch statement,we end up in default case, instead of case 8.
Please note, I'm looking for ways to detect such scenarios (but not solutions)
Are there any tools that can detect such issues in Embedded C?
It is possible for a static analysis tool that is context (thread/interrupt) aware to determine the use of shared data, and that such a tool could recognise specific mechanisms to protect such data (or lack thereof).
One such tool is Polyspace Code Prover; it is very expensive and very complex, and does a lot more besides that described above. Specifically to quote (elided) from the whitepaper here:
With abstract interpretation the following program elements are interpreted in new ways:
[...]
Any global shared data may change at any time in a multitask program, except when protection
mechanisms, such as memory locks or critical sections, have been applied
[...]
It may have improved in the long time since I used it, but one issue I had was that it worked on a lock-access-unlock idiom, where you specified to the tool what the lock/unlock calls or macros were. The problem with that is that the C++ project I worked on used a smarter method where a locking object (mutex, scheduler-lock or interrupt disable for example) locked on instantiation (in the constructor) and unlocked in the destructor so that it unlocked automatically when the object went out of scope (a lock by scope idiom). This meant that the unlock was implicit and invisible to Polyspace. It could however at least identify all the shared data.
Another issue with the tool is that you must specify all thread and interrupt entry points for concurrency analysis, and in my case these were private-virtual functions in task and interrupt classes, again making them invisible to Polyspace. This was solved by conditionally making the entry-points public for the abstract analysis only, but meant that the code being tested does not have the exact semantics of the code to be run.
Of course these are non-problems for C code, and in my experience Polyspace is much more successfully applied to C in any case; you are far less likely to be writing code in a style to suit the tool rather than the tool working with your existing code-base.
There are no such tools as far as I am aware. And that is probably because you can't detect them.
Pretty much every operation in your C code has the potential to get interrupted before it is finished. Less obvious than the 16 bit scenario, there is also this:
uint8_t a, b;
...
a = b;
There is no guarantees that this is atomic! The above assignment may as well translate to multiple assembler instructions, such as 1) load a into register, 2) store register at memory address. You can't know this unless you disassemble the C code and check.
This can create very subtle bugs. Any assumption of the kind "as long as I use 8 bit variables on my 8 bit CPU, I can't get interrupted" is naive. Even if such code would result in atomic operations on a given CPU, the code is non-portable.
The only reliable, fully-portable solution is to use some manner of semaphore. On embedded systems, this could be as simple as a bool variable. Another solution is to use inline assembler, but that can't be ported cross platform.
To solve this, C11 introduced the qualifier _Atomic to the language. However, C11 support among embedded systems compilers is still mediocre.

Alternative to writing masks for 32 bit microcontrollers

I am working on a project that involves programming 32 bit ARM micro-controllers. As in many embedded software coding work, setting and clearing bits are essential and quite repetitive task. Masking strategy is useful when working with micros rather than 32 bits to set and clear bits. But when working with 32 bit micro-contollers, it is not really practical to write masks each time we need to set/clear a single bit.
Writing functions to handle this could be a solution; however having a function occupies memory which is not ideal in my case.
Is there any better alternative to handle bit setting/clearing when working with 32 bit micros?
In C or C++, you would typically define macros for bit masks and combine them as desired.
/* widget.h */
#define WIDGET_FOO 0x00000001u
#define WIDGET_BAR 0x00000002u
/* widget_driver.c */
static uint32_t *widget_control_register = (uint32_t*)0x12346578;
int widget_init (void) {
*widget_control_register |= WIDGET_FOO;
if (*widget_control_register & WIDGET_BAR) log(LOG_DEBUG, "widget: bar is set");
}
If you want to define the bit masks from the bit positions rather than as absolute values, define constants based on a shift operation (if your compiler doesn't optimize these constants, it's hopeless).
#define WIDGET_FOO (1u << 0)
#define WIDGET_BAR (1u << 1)
You can define macros to set bits:
/* widget.h */
#define WIDGET_CONTROL_REGISTER_ADDRESS ((uint32_t*)0x12346578)
#define SET_WIDGET_BITS(m) (*WIDGET_CONTROL_REGISTER_ADDRESS |= (m))
#define CLEAR_WIDGET_BITS(m) (*WIDGET_CONTROL_REGISTER_ADDRESS &= ~(uint32_t)(m))
You can define functions rather than macros. This has the advantage of added type verifications during compilations. If you declare the function as static inline (or even just static) in a header, a good compiler will inline the function everywhere, so using a function in your source code won't cost any code memory (assuming that the generated code for the function body is smaller than a function call, which should be the case for a function that merely sets some bits in a register).
/* widget.h */
#define WIDGET_CONTROL_REGISTER_ADDRESS ((uint32_t*)0x12346578)
static inline void set_widget_bits(uint32_t m) {
*WIDGET_CONTROL_REGISTER_ADDRESS |= m;
}
static inline void set_widget_bits(uint32_t m) {
*WIDGET_CONTROL_REGISTER_ADDRESS &= ~m;
}
The other common idiom for registers providing access to individual bits or groups of bits is to define a struct containing bitfields for each register of your device. This can get tricky, and it is dependent on the C compiler implementation. But it can also be clearer than macros.
A simple device with a one-byte data register, a control register, and a status register could look like this:
typedef struct {
unsigned char data;
unsigned char txrdy:1;
unsigned char rxrdy:1;
unsigned char reserved:2;
unsigned char mode:4;
} COMCHANNEL;
#define CHANNEL_A (*(COMCHANNEL *)0x10000100)
// ...
void sendbyte(unsigned char b) {
while (!CHANNEL_A.txrdy) /*spin*/;
CHANNEL_A.data = b;
}
unsigned char readbyte(void) {
while (!CHANNEL_A.rxrdy) /*spin*/;
return CHANNEL_A.data;
}
Access to the mode field is just CHANNEL_A.mode = 3;, which is a lot clearer than writing something like *CHANNEL_A_MODE = (*CHANNEL_A_MODE &~ CHANNEL_A_MODE_MASK) | (3 << CHANNEL_A_MODE_SHIFT);. Of course, the latter ugly expression would usually be (mostly) covered over by macros.
In my experience, once you established a style for describing your peripheral registers you are best served by following that style over the whole project. The consistency will have huge benefits for future code maintenance, and over the lifetime of a project that factor likely is more important that the relatively small detail of whether you adopted the struct and bitfields or macro style.
If you are coding for a target which has already set a style in its manufacturer provided header files and customary compiler toolchain, adopting that style for your own custom hardware and low level code may be best as it will provide the best match between manufacturer documentation and your coding style.
But if you have the luxury of establishing the style for your development at the outset, your compiler platform is well enough documented to permit you to reliably describe device registers with bitfields, and you expect to use the same compiler for the lifetime of the product, then that is often a good way to go.
You can actually have it both ways too. It isn't that unusual to wrap the bitfield declarations inside a union that describes the physical registers, allowing their values to be easily manipulated all bits at once. (I know I've seen a variation of this where conditional compilation was used to provide two versions of the bitfields, one for each bit order, and a common header file used toolchain-specific definitions to decide which to select.)
typedef struct {
unsigned char data;
union {
struct {
unsigned char txrdy:1;
unsigned char rxrdy:1;
unsigned char reserved:2;
unsigned char mode:4;
} bits;
unsigned char status;
};
} COMCHANNEL;
// ...
#define CHANNEL_A_MODE_TXRDY 0x01
#define CHANNEL_A_MODE_TXRDY 0x02
#define CHANNEL_A_MODE_MASK 0xf0
#define CHANNEL_A_MODE_SHIFT 4
// ...
#define CHANNEL_A (*(COMCHANNEL *)0x10000100)
// ...
void sendbyte(unsigned char b) {
while (!CHANNEL_A.bits.txrdy) /*spin*/;
CHANNEL_A.data = b;
}
unsigned char readbyte(void) {
while (!CHANNEL_A.bits.rxrdy) /*spin*/;
return CHANNEL_A.data;
}
Assuming your compiler understands the anonymous union then you can simply refer to CHANNEL_A.status to get the whole byte, or CHANNEL_A.mode to refer to just the mode field.
There are some things to watch for if you go this route. First, you have to have a good understanding of structure packing as defined in your platform. The related issue is the order in which bit fields are allocated across their storage, which can vary. I've assumed that the low order bit is assigned first in my examples here.
There may also be hardware implementation issues to worry about. If a particular register must always be read and written 32 bits at a time, but you have it described as a bunch of small bit fields, the compiler might generate code that violates that rule and accesses only a single byte of the register. There is usually a trick available to prevent this, but it will be highly platform dependent. In this case, using macros with a fixed sized register will be less likely to cause a strange interaction with your hardware device.
These issues are very compiler vendor dependent. Even without changing compiler vendors, #pragma settings, command line options, or more likely optimization level choices can all affect memory layout, padding, and memory access patterns. As a side effect, they will likely lock your project to a single specific compiler toolchain, unless heroic efforts are used to create register definition header files that use conditional compilation to describe the registers differently for different compilers. And even then, you are probably well served to include at least one regression test that verifies your assumptions so that any upgrades to the toolchain (or well-intentioned tweaks to the optimization level) will cause any issues to get caught before they are mysterious bugs in code that "has worked for years".
The good news is that the sorts of deep embedded projects where this technique makes sense are already subject to a number of toolchain lock in forces, and this burden may not be a burden at all. Even if your product development team moves to a new compiler for the next product, it is often critical that firmware for a particular product be maintained with the very same toolchain over its lifetime.
If you use the Cortex M3 you can use bit-banding
Bit-banding maps a complete word of memory onto a single bit in the bit-band region. For example, writing to one of the alias words will set or clear the corresponding bit in the bitband region.
This allows every individual bit in the bit-banding region to be directly accessible from a word-aligned address using a single LDR instruction. It also allows individual bits to be toggled from C without performing a read-modify-write sequence of instructions.
If you have C++ available, and there's a decent compiler available, then something like QFlags is a good idea. It gives you a type-safe interface to bit flags.
It is likely to produce better code than using bitfields in structures, since the bitfields can only be changed one at a time and will likely translate to at least one load/modify/store per each changed bitfield. With a QFlags-like approach, you can get one load/modify/store per each or-assign or and-assign statement. Note that the use of QFlags doesn't require the inclusion of the entire Qt framework. It's a stand-alone header file (after minor tweaks).
At the driver level setting and clearing bits with masks is very common and sometimes the only way. Besides, it's an extremely quick operation; only a few instructions. It may be worthwhile to set up a function that can clear or set certain bits for readability and also reusability.
It's not clear what type of registers you are setting and clearing bits in but in general there are two cases you have to worry about in embedded systems:
Setting and clearing bits in a read/write register
If you want to change a single bit (or a handful of bits) in a read and write register you will first have to read the register, set or clear the appropriate bit using masks and whatever else to get the correct behavior, and then write back to the same register. That way you don't change the other bits.
Writing to separate Set and Clear registers (common in ARM micros) Sometimes there are separate Set and Clear registers. You can write just a single bit to a clear register and it will clear that bit. For instance, if there is a register you want to clear bit 9, just write (1<<9) to the clear register. You don't have to worry about modifying the other bits. Similar for the set register.
You can set and clear bits with a function that takes up as much memory as doing it with a mask:
#define SET_BIT(variableName, bitNumber) variableName |= (0x00000001<<(bitNumber));
#define CLR_BIT(variableName, bitNumber) variableName &= ~(0x00000001<<(bitNumber));
int myVariable = 12;
SET_BIT(myVariable, 0); // myVariable now equals 13
CLR_BIT(myVariable, 1); // myVariable now equals 11
These macros will produce exactly the same assembler instructions as a mask.
Alternatively, you could do this:
#define BIT(n) (0x00000001<<n)
#define NOT_BIT(n) ~(0x00000001<<n)
int myVariable = 12;
myVariable |= BIT(4); //myVariable now equals 28
myVariable &= NOT_BIT(3); //myVariable now equals 20
myVariable |= BIT(5) |
BIT(6) |
BIT(7) |
BIT(8); //myVariable now equals 500

AVR 8bit, C standard compliance regarding bit accessing of SFRs

One of my colleagues ran in some strange problems with programming an ATMega, related to accessing input - output ports.
Observing the problem after some research I concluded we should avoid accessing SFR's using operations which may compile to SBI or CBI instructions if we aim for a safe C standard compliant software. I am looking for whether this decision was righteous or not, so if my concerns here are valid.
The datasheet of the Atmel processor is here, it's an ATMega16. I will refer to some pages of this document below.
I will refer to the C standard using the version found on this site under the WG14 N1256 link.
The SBI and CBI instructions of the processor operate at bit-level accessing only the bit in question. So they are not true Read-Modify-Write (R-M-W) instructions since they, as I understand, do not perform a read (of the targeted 8 bit SFR).
On page 50 of the above datasheet the first sentence begins like All AVR ports have true Read-Modify-Write functionality..., while ongoing it specifies that this only applies to accesses with the SBI and CBI instructions which technically are not R-M-W. The datasheet does not define what reading for example the PORTx registers are supposed to return (it however indicates that they are readable). So I assumed reading these SFRs are undefined (they might return the last thing written on them or the current input state or whatever).
On page 70 it lists some external interrupt flags, this is interesting because this is where the nature of the SBI and CBI instructions come to be important. The flags are set when an interrupt occurred, and they may be cleared by writing them to one. So if SBI was a true R-M-W instruction, it would clear all three flags regardless of the bit specified in the opcode.
And now let's get into the matters of C.
The compiler itself is truly irrelevant, the only important fact is that it might use the CBI and SBI instructions in certain situations which I think make it non-compliant.
In the above mentioned C99 standard, the section 5.1.2.3 Program execution, point 2 and 3 refers to this (on page 13), and 6.7.3 Type qualifiers, point 6 (on page 109). The latter mentions that What constitutes an access to an object that has volatile-qualified type is implementation-defined, however a few phrases before it requires that any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine.
Also note that hardware ports such as that used in the example are declared volatile in the appropriate headers.
Example:
PORTA |= 1U << 6;
This is known to translate to an SBI. This implies that only a Write access happens on the volatile (PORTA) object. However if one would write:
var = 6;
...
PORTA |= 1U << var;
That would not translate to an SBI even though it will still only set one bit (since SBI has the bit to set encoded in the opcode). So this will expand to a true R-M-W sequence with a potentially different result than above (in the case of PORTA this is undefined behaviour as far as I could deduct from the datasheet).
By the C standard this behaviour might or might not be permitted. It is messy in that term too that here two things happen which mix in. One, the more apparent is the lack of the Read access in one of the cases. The other, less apparent is how the Write is performed.
If the compiled code omits the Read, it might fail to trigger hardware behaviour which is tied to such an access. However the AVR as far as I know has no such mechanism, so it might pass by the standard.
The Write is more interesting, however it also takes in the Read.
Omitting the Read in the case of using SBI implies that the affected SFR's must all work like latches (or any bit not working like so is either tied to 0 or 1), so the compiler can be sure of what it would read from them if it actually did the access. If this was not be the case then the compiler would at least be buggy. By the way this also clashes with that the datasheet did not define what is read from the PORTx registers.
How the write is performed is also a source of inconsistency: the result is different depending on how the compiler compiles it (a CBI or SBI affecting only one bit, a byte write affecting all bits). So writing code to clear / set one bit might either "work" (as in not "accidentally" clearing interrupt flags), or not if the compiler produces a true R-M-W sequence instead.
Maybe these are technically permitted by the C standard (as "implementation defined" behaviour, and the compiler deducting these cases that the Read access is not necessary to the volatile object), but at least I would consider it a buggy or inconsistent implementation.
Another example:
PORTA = PORTA | (1U << 6);
It is clearly visible that normally to conform with the standard a Read and then a Write of PORTA should be carried out. While according to the behaviour of SBI, it will lack a Read access, although as above this may pass for a mix of implementation defined behaviour and the compiler deducting that the Read is unnecessary here. (Or was my assumption wrong? That is assuming a |= b identical to a = a | b?)
So based on these I settled with that we should avoid these types of code as it is (or may be in the future) unclear how they might behave depending on whether the compiler would use SBI or CBI, or a true R-M-W sequence.
To tell the truth I mostly went after various forum posts etc. resolving this, not analysing actual compiler output. Not my project after all (and now I am not at work). I accepted it reading AVRFreaks for example that AVR-GCC would output these instructions in the above mentioned situations which alone may pose a problem even if with the actual version we used we wouldn't observe this. (However I think this case it stood as my suggestion to implement port accesses using a shadow work variables fixed the problems my colleague observed)
Note: I edited the middle based on some research on the C (C99) standard.
Edit: Reading the AVR Libc FAQ I again found something which contradicts the automatic use of SBI or CBI. It is the last question & answer where it specifically states that since the ports are declared volatile the compiler can not optimize out the read access, according to the rules of the C language (as it phrases).
I also understand that it is very unlikely that this particular behaviour (that is using SBI or CBI) would directly introduce bugs, but by masking "bugs" it may introduce very nasty ones in the long run if someone accidentally generalizes based on this behaviour while not understanding the AVR at assembly level.
You should probably stop trying to apply the C memory model to I/O registers. They are not plain memory. In the case of PORTn registers, it is in fact irrelevant whether it is a single bit write or a R-M-W operation unless you're mixing in interrupts. If you do read-modify-write an interrupt may alter state in between, causing a race condition; but that would be exactly the same issue for memory. The advantage of the SBI/CBI instructions there is that they are atomic.
The PORTn registers are readable, and also drive the output buffers. They are not different functions on read and write (as on PIC), but a normal register. Newer PICs also have the output registers readable on LAT addresses, precisely so you won't need a shadow variable. Other SFRs such as PINn or interrupt flags have more complicated behaviour. On recent AVRs, writing to PINn instead toggles bits in PORTn, which again is useful for its fast and atomic operation. Writing 1s to interrupt flag registers clears them, again to prevent race conditions.
The point is, these features are in place to produce correct behaviour for hardware aware programs, even if some of it looks odd in C code (i.e. using reg=_BV(2); instead of reg&=~_BV(2);). Precise compliance with the C standard is an impractical goal when the code is by its very nature hardware specific (though semantic similarity does help, which the interrupt flag behaviour fails at). Wrapping the odd constructs in inline functions or macros with names that explain what they truly do is probably a good idea, or at least commenting what the effects are. A set of such I/O routines could also form the basis of a hardware abstraction layer that may help you port code.
Trying to interpret the C specification strictly here is also rather confusing, as it doesn't admit to addressing bits (which is what SBI and CBI do), and digging through my old (1992) copy finds that volatile accesses may result in several implementation defined behaviours, including the possibility of no accesses at all.

How to guarantee 64-bit writes are atomic?

When can 64-bit writes be guaranteed to be atomic, when programming in C on an Intel x86-based platform (in particular, an Intel-based Mac running MacOSX 10.4 using the Intel compiler)? For example:
unsigned long long int y;
y = 0xfedcba87654321ULL;
/* ... a bunch of other time-consuming stuff happens... */
y = 0x12345678abcdefULL;
If another thread is examining the value of y after the first assignment to y has finished executing, I would like to ensure that it sees either the value 0xfedcba87654321 or the value 0x12345678abcdef, and not some blend of them. I would like to do this without any locking, and if possible without any extra code. My hope is that, when using a 64-bit compiler (the 64-bit Intel compiler), on an operating system capable of supporting 64-bit code (MacOSX 10.4), that these 64-bit writes will be atomic. Is this always true?
Your best bet is to avoid trying to build your own system out of primitives, and instead use locking unless it really shows up as a hot spot when profiling. (If you think you can be clever and avoid locks, don't. You aren't. That's the general "you" which includes me and everybody else.) You should at minimum use a spin lock, see spinlock(3). And whatever you do, don't try to implement "your own" locks. You will get it wrong.
Ultimately, you need to use whatever locking or atomic operations your operating system provides. Getting these sorts of things exactly right in all cases is extremely difficult. Often it can involve knowledge of things like the errata for specific versions of specific processor. ("Oh, version 2.0 of that processor didn't do the cache-coherency snooping at the right time, it's fixed in version 2.0.1 but on 2.0 you need to insert a NOP.") Just slapping a volatile keyword on a variable in C is almost always insufficient.
On Mac OS X, that means you need to use the functions listed in atomic(3) to perform truly atomic-across-all-CPUs operations on 32-bit, 64-bit, and pointer-sized quantities. (Use the latter for any atomic operations on pointers so you're 32/64-bit compatible automatically.) That goes whether you want to do things like atomic compare-and-swap, increment/decrement, spin locking, or stack/queue management. Fortunately the spinlock(3), atomic(3), and barrier(3) functions should all work correctly on all CPUs that are supported by Mac OS X.
On x86_64, both the Intel compiler and gcc support some intrinsic atomic-operation functions. Here's gcc's documentation of them: http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html
The Intel compiler docs also talk about them here: http://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/347603.pdf (page 164 or thereabouts).
According to Chapter 7 of Part 3A - System Programming Guide of Intel's processor manuals, quadword accesses will be carried out atomically if aligned on a 64-bit boundary, on a Pentium or newer, and unaligned (if still within a cache line) on a P6 or newer. You should use volatile to ensure that the compiler doesn't try to cache the write in a variable, and you may need to use a memory fence routine to ensure that the write happens in the proper order.
If you need to base the value written on an existing value, you should use your operating system's Interlocked features (e.g. Windows has InterlockedIncrement64).
On Intel MacOSX, you can use the built-in system atomic operations. There isn't a provided atomic get or set for either 32 or 64 bit integers, but you can build that out of the provided CompareAndSwap. You may wish to search XCode documentation for the various OSAtomic functions. I've written the 64-bit version below. The 32-bit version can be done with similarly named functions.
#include <libkern/OSAtomic.h>
// bool OSAtomicCompareAndSwap64Barrier(int64_t oldValue, int64_t newValue, int64_t *theValue);
void AtomicSet(uint64_t *target, uint64_t new_value)
{
while (true)
{
uint64_t old_value = *target;
if (OSAtomicCompareAndSwap64Barrier(old_value, new_value, target)) return;
}
}
uint64_t AtomicGet(uint64_t *target)
{
while (true)
{
int64 value = *target;
if (OSAtomicCompareAndSwap64Barrier(value, value, target)) return value;
}
}
Note that Apple's OSAtomicCompareAndSwap functions atomically perform the operation:
if (*theValue != oldValue) return false;
*theValue = newValue;
return true;
We use this in the example above to create a Set method by first grabbing the old value, then attempting to swap the target memory's value. If the swap succeeds, that indicates that the memory's value is still the old value at the time of the swap, and it is given the new value during the swap (which itself is atomic), so we are done. If it doesn't succeed, then some other thread has interfered by modifying the value in-between when we grabbed it and when we tried to reset it. If that happens, we can simply loop and try again with only minimal penalty.
The idea behind the Get method is that we can first grab the value (which may or may not be the actual value, if another thread is interfering). We can then try to swap the value with itself, simply to check that the initial grab was equal to the atomic value.
I haven't checked this against my compiler, so please excuse any typos.
You mentioned OSX specifically, but in case you need to work on other platforms, Windows has a number of Interlocked* functions, and you can search the MSDN documentation for them. Some of them work on Windows 2000 Pro and later, and some (particularly some of the 64-bit functions) are new with Vista. On other platforms, GCC versions 4.1 and later have a variety of __sync* functions, such as __sync_fetch_and_add(). For other systems, you may need to use assembly, and you can find some implementations in the SVN browser for the HaikuOS project, inside src/system/libroot/os/arch.
On X86, the fastest way to atomically write an aligned 64-bit value is to use FISTP. For unaligned values, you need to use a CAS2 (_InterlockedExchange64). The CAS2 operation is quite slow due to BUSLOCK though so it can often be faster to check alignment and do the FISTP version for aligned addresses. Indeed, this is how the Intel Threaded building Blocks implements Atomic 64-bit writes.
The latest version of ISO C (C11) defines a set of atomic operations, including atomic_store(_explicit). See e.g. this page for more information.
The second most portable implementation of atomics are the GCC intrinsics, which have already been mentioned. I find that they are fully supported by GCC, Clang, Intel, and IBM compilers, and - as of the last time I checked - partially supported by the Cray compilers.
One clear advantage of C11 atomics - in addition to the whole ISO standard thing - is that they support a more precise memory consistency prescription. The GCC atomics imply a full memory barrier as far as I know.
If you want to do something like this for interthread or interprocess communication, then you need to have more than just an atomic read/write guarantee. In your example, it appears that you want the values written to indicate that some work is in progress and/or has been completed. You will need to do several things, not all of which are portable, to ensure that the compiler has done things in the order you want them done (the volatile keyword may help to a certain extent) and that memory is consistent. Modern processors and caches can perform work out of order unbeknownst to the compiler, so you really need some platform support (ie., locks or platform-specific interlocked APIs) to do what it appears you want to do.
"Memory fence" or "memory barrier" are terms you may want to research.
GCC has intrinsics for atomic operations; I suspect you can do similar with other compilers, too. Never rely on the compiler for atomic operations; optimization will almost certainly run the risk of making even obviously atomic operations into non-atomic ones unless you explicitly tell the compiler not to do so.

Resources