How do memory fences work? - c

I need to understand memory fences in multicore machines. Say I have this code
Core 1
mov [_x], 1; mov r1, [_y]
Core 2
mov [_y], 1; mov r2, [_x]
Now the unexpected results without memory fences would be that both r1 and r2 can be 0 after execution. In my opinion, to counter that problem, we should put memory fence in both codes, as putting it to only one would still not solve the problem. Something like as follows...
Core 1
mov [_x], 1; memory_fence; mov r1, [_y]
Core 2
mov [_y], 1; memory_fence; mov r2, [_x]
Is my understanding correct or am I still missing something? Assume the architecture is x86. Also, can someone tell me how to put memory fences in a C++ code?

Fences serialize the operation that they fence (loads & stores), that is, no other operation may start till the fence is executed, but the fence will not execute till all preceding operations have completed. quoting intel makes the meaning of this a little more precise (taken from the MFENCE instruction, page 3-628, Vol. 2A, Intel Instruction reference):
This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes globally visible before any load or store instruction that
follows the MFENCE instruction.1
A load instruction is considered to become globally visible when
the value to be loaded into its destination register is determined.
Using fences in C++ is tricky (C++11 may have fence semantics somewhere, maybe someone else has info on that), as it is platform and compiler dependent. For x86 using MSVC or ICC, you can use the _mm_lfence, _mm_sfence & _mm_mfence for load, store and load + store fencing (note that some of these are SSE2 instructions).
Note: this assumes an Intel perspective, that is: one using an x86 (32 or 64 bit) or IA64 processor

C++11 (ISO/IEC 14882:2011) defines a multi-threading-aware memory model.
Although I don't know of any compiler that currently implements the new memory model, C++ Concurrency in Action by Anthony Williams documents it very well. You may check Chapter 5 - The C++ Memory Model and Operations on Atomic Types where he explains about relaxed operations and memory fences. Also, he is the author of the just::thread library that may be used till we have compiler vendor support of the new standard.
just::thread is the base for the boost::thread library.

Related

Why is there no inbuilt swap function in C but there is xchg in Assembly?

Recently I came across Assembly language. x86 assembly has an xchg instruction which swaps the contents of two registers.
Since every C code is first converted to Assembly, it would have been nice if there was a swap function inbuilt in C like in the header stdio.h. Then whenever the compiler detects the swap function, it could add the xchg directive in the assembly file.
So why this swap function was not implemented in C?
C is a cross-platform language. Assembly is architecture specific. Not every architecture has such an instruction. Moreover, C, as a high-level language doesn't have to correspond to the machine-level instruction set and features, as it's purpose is to bridge between the "human" language and the machine language, not to mimic it. Said that, a C compiler for this specific architecture might have an extension for this swapping instruction or optimize the swapping code to use this instruction if smart enough.
There are two points which can explain why swap() is not in C
1. Function call semantics:
Including a swap() function would break a very fundamental design decision in C: swap() can only work with pass-by-reference semantics (which C++ added to the language, but which are absent in C), not with pass-by-value.
2. Diversity of available assembler instructions
Apart from that, there is usually quite a number of assembler instructions on any given CPU architecture which are totally inaccessible from pure C. This includes instructions as diverse as interrupt handling instructions, virtual memory space manipulating instructions, I/O instructions, bit fiddling instructions (google the PPC instruction rlwimi for an especially powerful example of this), etc.
It is simply impossible to include any significant number of these in a general purpose language like C.
Some of these are crucial for implementing operating systems, which is why any OS must include at the very least some small amounts of assembler code. They are usually encapsulated in some functions with inline assembler or defined in the kernel headers as preprocessor directives. Other instructions are less important, or only good for optimizations, these may be generated by optimizing compilers, and many compilers do generate them (the whole class of vector functions fall in this category).
In the face of this vast diversity, the designers of C just had to cut it somewhere. And they opted for providing whatever is representable as simple operators like (+, -, ~, &, |, !, &&, ||, etc.), but did not provide anything that would require function call syntax like the swap() function you propose.
That would work for variables that fit in the register and are in the register. It would not work for large struct or variables held in memory (If you load a variable A in reg X and another, say B in reg Y, and swap them, you could skip the swapping and load A in Y and B in X directly).
Having said said, nothing prevent the compiler for a given architecture to use the swap instruction to compile:
int a;
int b;
int tmp;
tmp=a;
a=b;
b=tmp;
... If those happens to be in registers: the fact that it is not in C does not mean the compiler does not use it.
Besides what the other correct answers say, another part of your premise is wrong.
Only a really dumb compiler would want to actually emit xchg every time the source swapped variables, whether there's an intrinsic or operator for it or not. Optimizing compilers don't just transliterate C into asm, they typically convert to an SSA internal representation of the program logic, and optimize that so they can implement it with as few instructions as possible (or really in the most efficient way possible; using multiple fast instructions can be better than a single slower one).
xchg is rarely faster than 3 mov instructions, and a good compiler can simply change its local-variable <-> CPU register mapping without emitting any asm instructions in many cases. (Or inside a loop, unrolling can often optimize away swapping.) Often you need only 1 or mov instructions in asm, not all 3. e.g. if only one of the C vars being swapped needs to stay in the same register, you can do:
# start: x in EAX, y in ECX
mov edx, eax
mov eax, ecx
# end: y in EAX, x in EDX
See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Also note that xchg [mem], reg is atomic (implicit lock prefix), and thus is a full memory barrier, and much slower than 3 mov instructions, and with much higher impact on surrounding code because of the memory-barrier effect.
If you do actually need to exchange registers, 3x mov is pretty good. Often better than xchg reg,reg because of mov elimination, at the cost of more code-size and a tmp reg.
There's a reason compilers never use xchg. If xchg was a win, compilers would look for it as a peephole optimization the same way they look for inc eax over add eax,1, or xor eax,eax instead of mov eax,0. But they don't.
(semi-related: swapping 2 registers in 8086 assembly language(16 bits))
Even though xchg is a very elementary instruction, this doesn't mean C must have its equivalent. The fact that C sometimes maps directly to assembly is not very relevant; the standard says nothing about "assembly" (why map to assembly and not another low-level language?).
You might also ask: Why does C not have built-in vector instructions? They're becoming largely available!
There's also compiler's help: swapping variables is a very visible pattern, so such optimization shouldn't be hard to implement. And you also have inline asm, should you need it.

How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
But ARM Compiler armcc User Guide, paragraph 7.18 is saying:
"All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."
And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
Target CPU: Cortex M0+ (ARMv6-M).
Edit:
Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).
In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.
You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.
See: ldm/stm and gcc for issues with GCC stm/ldm.
Taking your example,
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.
However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).
If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

ARM Program Counter distinguishing feature

How does the R15 of ARM differ from the general PC of a CPU?
Both of them are program counters only. What is the difference?
ARM's PC is more similar to a regular register with some restrictions than x86's IP is similar to a regular register.
Considering general PC is an Intel x86 based CPU, in x86's case you can't manipulate PC (Instruction pointer) directly but it is updated implicitly by provided control flow instructions.
In ARM's case historically Program Counter (PC), mapped as register at index 15 (16th register) can be manipulated directly via arithmetic instructions. For example you can add 16 to PC which would alter flow of instruction stream similar to a 16-byte forward jump instruction.
The ARM PC maybe more of a general register than most CPUs, but it is still very special. The traditional simple arithmetic instructions can use the PC as an input argument in many cases. Here it functions as a pointer or array base. It can also be used as the output for control transfer with these instructions. As a read-only value, it is useful for calculating return values in a PC-independent way. It is also useful to use as a constant table look-up in near-by code. For these cases, the PC is very much like a regular register. This is probably more common on many RISC CPUs as opposed to a CISC ISA.
However, when the PC is used as a destination (lvalue or updated and written), the behavior is often non-standard. Some examples of special cases (for some ARM architechure versions) for R15/PC are,
adcs - copies SPSR to CPSR
adds - copies SPSR to CPSR
ands - copies SPSR to CPSR
bics - copies SPSR to CPSR
bx r15 - highly discourage or not supported.
clz r15 - not supported.
mcr pXX, xx, r15,... - unpredictable
etc.
In most cases, using the PC as a destination of an instruction will have some special case. Especially, the use of the S (normally to set conditions codes) can be used to return from an exception. This might be used as some sort of veneer when returning from an exception or just a direct return. In some cases, the meaning of the instruction might change completely. For instance, ldm sp, {r0-r15}^ and ldm sp, {r0-r14}^ use different register banks; the first will load the registers according to the mode in the SPSR; whereas the 2nd will load the register to user mode.
For load/store, atomics, mode manipulation, co-processor and complex arithmetic (64 bit multiplies, etc) instructions, the PC is often unsupported or has a different meaning; the different meaning is often a mechanism for handling exceptions for system level code.

How come INC instruction of x86 is not atomic? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 5 years ago.
I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?
Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.
Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.
Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:
INC EAX
gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:
INC DWORD PTR [ EAX ]
the low-level decomposition though would look more like:
uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg
and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.
The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?
You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.

Compare and swap in machine code in C

How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? Can it be any more specific if its written only for the i7 processor?
Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?
Thank you.
The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)
In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.
You can use the CMPXCHG instruction with the LOCK prefix for atomic execution.
E.g.
lock cmpxchg DWORD PTR [ebx], edx
or
lock cmpxchgl %edx, (%ebx)
This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.
You need to have a 486 or later for this instruction to be available.
If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86.
Variable must be 8 byte aligned.
Example:
mov eax, OldDataA //load Old first 32 bits
mov edx, OldDataB //load Old second 32 bits
mov ebx, NewDataA //load first 32 bits
mov ecx, NewDataB //load second 32 bits
mov edi, Destination //load destination pointer
lock cmpxchg8b qword ptr [edi]
setz al //if transfer is succesful the al is 1 else 0
If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.
In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference
Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.
It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.
A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.

Resources