Is unaligned access in Cortex-M4 atomic? - arm

In the ARM documentation, it mentions that
The Cortex-M4 processor supports ARMv7 unaligned accesses, and
performs all accesses as single, unaligned accesses. They are
converted into two or more aligned accesses by the DCode and System
bus interfaces.
It's not clear to me if this means the data access is atomic to the programmer or not. Then I found a StackOverflow comment interpreting the documentation as:
Actually some ARM processors like the Cortex-M3 support unaligned
access in HW, so even an unaligned read/write is atomic. The access
may span multiple bus cycles to memory, but there is no opportunity
for another instruction to jump in between, so it is atomic to the
programmer.
However, I looked around some more and found claims that contradicts the previous claim:
Another one is the fact that on cores beginning ARMv6 and later, in
order for the hardware to “fix-up” an unaligned access, it splits it
up into multiple smaller, byte loads. However, these are not atomic!.
So, who do I believe? For some context, I have setters/getters for each element in a packed struct in my project. In other words, some struct elements may be unaligned. I was wondering if accessing the struct elements will always be guaranteed to be atomic on Cortex-M4. If it's not, I am thinking I will have to enable/disable interrupts manually or add in some mutex, but I'd rather not if ARM Cortex M4 can just guarantee the data accesses to be atomic.

Nope, it isn't.
See section A3.5.3 of the ARMv7-M Architecture Reference Manual:
In ARMv7-M, the single-copy atomic processor accesses are:
All byte accesses.
All halfword accesses to halfword-aligned locations.
All word accesses to word-aligned locations
So, if you are copying a uint32 that isn't aligned to a 32-bit boundary (which is allowed in v7-M), the copy isn't atomic.
Also quoting:
When an access is not single-copy atomic, it is executed as a sequence of smaller accesses, each of which is single-copy atomic, at least at the byte level.

Related

Alignment requirement for variables protected using gcc's __sync_xxx functions on ARM

My understanding of atomic operations on any HW platform is that it requires aligned memory addresses to provided atomic operations. On x86 an entire cache line is locked to implement CAS making the alignment requirement a cache line. On the ARM that I'm using the ERG is set to 4 implying a 64 byte Exclusive reservation granule. However in the code base that we have, I have found __sync_fetch_and_add() calls where the variable being operated on has no alignment requirements in its declaration.
Does anyone have experience with this on ARM? Should I align my atomic variables to the ERG boundary before using the __sync functions? The GCC documentation makes no mention of alignment.

What's the purpose of glib's g_atomic_int_get?

glib a provides g_atomic_int_get function to atomically read a standard C int type. Isn't reading 32-bit integers from memory into registers not already guaranteed to be an atomic operation by the processor (e.g. mov <reg32>, <mem>)?
If yes, then what's the purpose of glib's g_atomic_int_get function?
Some processors allow reading unaligned data, but that may take more than a single cycle. I.e. it's no longer atomic. On others it might not be an atomic operation at all to begin with.
The x86 mov instruction is not always atomic, either: it is non-atomic if the addresses involved are not naturally aligned.
Even if it were always atomic, it is not a memory barrier, which means the compiler is free to reorder the instruction with reference to other instructions nearby; and the processor is free to reorder the instruction with reference to other instructions in the instruction stream at runtime.
Unless you are writing code targeting only a single platform (and are sure that code will never need to be ported to another platform), you must always use explicit atomic instructions if you want atomic guarantees.

Why aren't structs packed by default?

While reading the CERT C Coding Standard, I came across DCL39-C which discusses why it's generally a bad idea for something like the Linux kernel to return an unpacked struct to userspace due to information leakage.
In a nutshell, structs aren't generally packed by default and the padding bytes between members of a struct often contain uninitialized data, hence the information leakage.
Why aren't structs packed by default? There was a mention in the guide that it's an optimization feature of compilers for specific architectures, I believe. Why is aligning structs to a certain byte size more efficient, as it wastes memory space?
Also, why doesn't the C standard specify a standardized way of asking for a packed struct? I can ask GCC using __attribute__((packed)), and there are other ways for different compilers, but it seems like a feature that'd be nice to have as part of the standard.
Data is carried though electronic circuits by groups of parallel wires (buses). Likewise, the circuits themselves tend to be arrayed in parallel. The physical distance between parallel components adds resistance and capacitance to any crosswise wires that bridge them. So, such bridges tend to be expensive and slow, and computer architects avoid them when possible.
Unaligned loads require shifting bytes across lanes. Some CPUs (e.g. efficiency-oriented RISC) are physically incapable of doing this, because the bridge component doesn't exist. Some will detect the condition and interpose a lane shift at the expense of a cycle or two. Others can handle misalignment without a speed penalty… assuming paged memory doesn't add another problem.
There's another, completely different issue. The memory management unit (MMU) sits between the CPU execution core and the memory bus, translating program-visible logical addresses to the physical addresses for memory chips. Two adjacent logical addresses might reside on different chips. The MMU is optimized for the common case where an access only requires one translation, but a misaligned access may require two.
A misaligned access straddling a page boundary might incur an exception, which might be fatal inside a kernel. Since pages are relatively large, this condition is relatively rare. It might evade tests, and it may be non-deterministic.
TL;DR: Packed structures shouldn't be used for active program state, especially not in a kernel. They may be used for serialization and communication, but safe usage is another question.
Leaving structs "unpacked" allows the compiler to align all members so that operations are more efficient on those members (measured in terms of clock time, number of instructions, etc). The alignment requirement for types depends on the host architecture and (for struct types) on the alignment requirement of contained members.
Packing struct members forces some (if not all) members to be aligned in a way that is sub-optimal for performance. In some worst cases - depending on host architecture - operations on unaligned variables (or on unaligned struct members) triggers a processor fault condition. RISC processor architectures, for example, generate an alignment fault when a load or store operation affects an unaligned address. Some SSE instructions on recent x86 architectures require data they act on to be aligned on 16 byte boundaries.
In best cases, the operations behave as intended, but less efficiently, due to overhead of copying an unaligned variable to an aligned location or to a register, doing the operation there, and copying it back. Those copying operations are less efficient when unaligned variables are involved - after all, the processor architecture is optimised for performance when variable alignment meets its design requirements.
If you are worried about data leaking out of your program, simply use functions like memset() to overwrite the contents of your structures at the end of their lifetime (e.g. just before an instance is about to pass out of scope, or immediately before dynamically allocated memory is deallocated using free()).
Or use an operating system (like OpenBSD) which does overwrite memory before making it available to processes or programs. Bear in mind that such features tend to make both the operating system and programs it hosts run less efficiently.
Recent versions of the C standard (since 2011) do have some facilities to query and control alignment of variables (and affect packing of struct members). The default is whatever alignment is most effective for the host architecture - which for struct types normally means unpacked.
On some compilers such as Microchip XC8, all structs are indeed always packed.
On some platforms compilers will only generate byte access instructions to access members of a packed struct, because byte access instructions are always aligned. If all structs are packed, the 16-, 32-, and 64- bit load/store instructions are not used. This is an obvious waste of resources.
The C standard does not specify a way of packing struct possibly because the standard itself is not aware of the concept of packing. Since the layout of non-bit-field members of structs is implementation defined, out of scope for the standard. Or possibly, the standard is made to support architectures that always add padding in structs, since such architectures are indeed theoretically feasible.

Store/load atomicity on ARM Cortex-A9 MPCore

Is it safe to assume that assigning and accessing 32 bit integers on an ARM Cortex-A9 MPCore implementations are atomic operations and that the assigned value is synchronized with all cores? Will the C compiler guarantee that
uint32_t *p;
*p = 4711;
and
uint32_t *p;
return *p;
are translated to atomic operations in assembler?
"Atomic" and "synchronized with all cores" are different requirements. All ARM cores in the market implement 32 bit operations to memory atomically (which is to say you can never see "part" of the word written without the rest). Not all of them are cache-coherent between cores, and the details (especially with the more exotic configurations like big.LITTLE) are complicated.
Use your OS synchronization primitives. That complexity is what they are designed to abstract.
No that is the reason for the strex/ldrex. Within the core a normal str and ldr are fine, but to insure access for the other cores you have to use strex/ldrex (and have a memory system that supports them).

Read and Write atomic operation implementation in the Linux Kernel

Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)
I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.
The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.
If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.

Resources