Compare and swap in machine code in C - c

How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? Can it be any more specific if its written only for the i7 processor?
Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?
Thank you.

The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)
In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.

You can use the CMPXCHG instruction with the LOCK prefix for atomic execution.
E.g.
lock cmpxchg DWORD PTR [ebx], edx
or
lock cmpxchgl %edx, (%ebx)
This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.
You need to have a 486 or later for this instruction to be available.

If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86.
Variable must be 8 byte aligned.
Example:
mov eax, OldDataA //load Old first 32 bits
mov edx, OldDataB //load Old second 32 bits
mov ebx, NewDataA //load first 32 bits
mov ecx, NewDataB //load second 32 bits
mov edi, Destination //load destination pointer
lock cmpxchg8b qword ptr [edi]
setz al //if transfer is succesful the al is 1 else 0

If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.
In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference
Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.

It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.
A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.

Related

Segmentation fault when attempting to print int value from x86 external function [duplicate]

I've noticed that a lot of calling conventions insist that [e]bx be preserved for the callee.
Now, I can understand why they'd preserve something like [e]sp or [e]bp, since that can mess up the callee's stack. I can also understand why you might want to preserve [e]si or [e]di since that can break the callee's string instructions if they aren't particularly careful.
But [e]bx? What on earth is so important about [e]bx? What makes [e]bx so special that multiple calling conventions insist that it be preserved throughout function calls?
Is there some sort of subtle bug/gotcha that can arise from messing with [e]bx?
Does modifying [e]bx somehow have a greater impact on the callee than modifying [e]dx or [e]cx for instance?
I just don't understand why so many calling conventions single out [e]bx for preservation.
Not all registers make good candidates for preserving:
no (e)ax -- Implicitly used in some instructions; Return value
no (e)dx -- edx:eax is implicity used in cdq, div, mul and in return values
(e)bx -- generic register, usable in 16-bit addressing modes (base)
(e)cx -- shift-counts, used in loop, rep
(e)si -- movs operations, usable in 16-bit addressing modes (index)
(e)di -- movs operations, usable in 16-bit addressing modes (index)
Must (e)bp -- frame pointer, usable in 16-bit addressing modes (base)
Must (e)sp -- stack pointer, not addressable in 8086 (other than push/pop)
Looking at the table, two registers have good reason to be preserved and two have a reason not to be preserved. accumulator = (e)ax e.g. is the most often used register due to short encoding. SI,DI make a logical register pair -- on REP MOVS and other string operations, both are trashed.
In a half and half callee/caller saving paradigm the discussion would basically go only if bx/cx is preferred over si/di. In other calling conventions, it's just EDX,EAX and ECX that can be trashed.
EBX does have a few obscure implicit uses that are still relevant in modern code (e.g. CMPXGH8B / CMPXGH16B), but it's the least special register in 32/64-bit code.
EBX makes a good choice for a call-preserved register because it's rare that a function will need to save/restore EBX because they need EBX specifically, and not just any non-volatile register. As Brett Hale's answer points out, it makes EBX a great choice for the global offset table (GOT) pointer in ABIs that need one.
In 16-bit mode, addressing modes were limited to (any subset of) [BP|BX + DI|SI + disp8/disp16]), so BX is definitely special there.
This is a compromise between not saving any of the registers and saving them all. Either saving none, or saving all, could have been proposed, but either extreme leads to inefficiencies caused by copying the contents to memory (the stack). Choosing to allow some registers to be preserved and some not, reduces the average cost of a function call.
One of the main reasons, certainly for the i386 ELF ABI, is that ebx holds the address of the global offset table (GOT) register for position-independent code (PIC). See 3-35 of the specification for the details. It would be disruptive in the extreme, if, say, shared library code had to restore the GOT after every function call return.

How come INC instruction of x86 is not atomic? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 5 years ago.
I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?
Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.
Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.
Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:
INC EAX
gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:
INC DWORD PTR [ EAX ]
the low-level decomposition though would look more like:
uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg
and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.
The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?
You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.

understanding the keywords eax and mov

I am trying to understand the registers in asm but every website I look at just assumes I know something about registers and I just cannot get a grip on it. I know about a books worth of c++ and as far as I know mov var1,var2 would be the same thing as var1 = var2, correct?
But with the eax register I am completely lost. Any help is appreciated.
Consider registers as per-processor global variables. There's "eax", "ebx", and a bunch of others. Furthermore, you can only perform certain operations via registers - for example there's no instruction to read from one memory location and write it to another (except when the locations are denoted by certain registers - see movsb instruction, etc).
So the registers are generally used only for temporary storage of values that are needed for some operation, but they usually are not used as global variables in the conventional sense.
You are right that "mov var1, var2" is essentially an assignment - but you cannot use two memory-based variables as operands; that's not supported. You could instead do:
mov eax, var1
mov var2, eax
... which has the same effect, using the eax register as a temporary.
eax refers to a processor register (essentially a variable)
mov is an instruction to copy data from one register to another. So essentially you are correct (in a handwavey sense)
Do you have an example assembly block you want to discuss?
Think of eax as a location in memory where a value can be stored, much like in c++ where int, long,... and other types specify the size of the location in memory of a variable. The eax register simply points to a storage location in memory, which on x86 computers is 32 bits. The e part of eax means extended. This register -> memory location is automatically used by the multiplication and division operators and normally called the extended accumulator register.

How do memory fences work?

I need to understand memory fences in multicore machines. Say I have this code
Core 1
mov [_x], 1; mov r1, [_y]
Core 2
mov [_y], 1; mov r2, [_x]
Now the unexpected results without memory fences would be that both r1 and r2 can be 0 after execution. In my opinion, to counter that problem, we should put memory fence in both codes, as putting it to only one would still not solve the problem. Something like as follows...
Core 1
mov [_x], 1; memory_fence; mov r1, [_y]
Core 2
mov [_y], 1; memory_fence; mov r2, [_x]
Is my understanding correct or am I still missing something? Assume the architecture is x86. Also, can someone tell me how to put memory fences in a C++ code?
Fences serialize the operation that they fence (loads & stores), that is, no other operation may start till the fence is executed, but the fence will not execute till all preceding operations have completed. quoting intel makes the meaning of this a little more precise (taken from the MFENCE instruction, page 3-628, Vol. 2A, Intel Instruction reference):
This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes globally visible before any load or store instruction that
follows the MFENCE instruction.1
A load instruction is considered to become globally visible when
the value to be loaded into its destination register is determined.
Using fences in C++ is tricky (C++11 may have fence semantics somewhere, maybe someone else has info on that), as it is platform and compiler dependent. For x86 using MSVC or ICC, you can use the _mm_lfence, _mm_sfence & _mm_mfence for load, store and load + store fencing (note that some of these are SSE2 instructions).
Note: this assumes an Intel perspective, that is: one using an x86 (32 or 64 bit) or IA64 processor
C++11 (ISO/IEC 14882:2011) defines a multi-threading-aware memory model.
Although I don't know of any compiler that currently implements the new memory model, C++ Concurrency in Action by Anthony Williams documents it very well. You may check Chapter 5 - The C++ Memory Model and Operations on Atomic Types where he explains about relaxed operations and memory fences. Also, he is the author of the just::thread library that may be used till we have compiler vendor support of the new standard.
just::thread is the base for the boost::thread library.

Difference between n = 0 and n = n - n

When I read this question I remembered someone once telling me (many years ago) that from an assembler-point-of-view, these two operations are very different:
n = 0;
n = n - n;
Is this true, and if it is, why is it so?
EDIT: As pointed out by some replies, I guess this would be fairly easy for a compiler to optimize into the same thing. But what I find interesting is why they would differ if the compiler had a completely general approach.
Writing assembler code you often used:
xor eax, eax
instead of
mov eax, 0
That is because with the first statement you have only the opcode and no involved argument. Your CPU will do that in 1 cylce (instead of 2). I think your case is something similar (although using sub).
Compiler VC++ 6.0, without optimisations:
4: n = 0;
0040102F mov dword ptr [ebp-4],0
5:
6: n = n - n;
00401036 mov eax,dword ptr [ebp-4]
00401039 sub eax,dword ptr [ebp-4]
0040103C mov dword ptr [ebp-4],eax
In the early days, memory and CPU cycles were scarce. That lead to a lot of so called "peep-hole optimizations". Let's look at the code:
move.l #0,d0
moveq.l #0,d0
sub.l a0,a0
The first instruction would need two bytes for the op-code and then four bytes for the value (0). That meant four bytes wasted plus you'd need to access the memory twice (once for the opcode and once for the data). Sloooow.
moveq.l was better since it would merge the data into the op-code but it only allowed to write values between 0 and 7 into a register. And you were limited to data registers only, there was no quick way to clear an address register. You'd have to clear a data register and then load the data register into an address register (two op-codes. Bad.).
Which lead to the last operation which works on any register, need only two bytes, a single memory read. Translated into C, you'd get
n = n - n;
which would work for most often used types of n (integer or pointer).
An optimizing compiler will produce the same assembly code for the two.
It may depend on whether n is declared as volatile or not.
The assembly-language technique of zeroing a register by subtracting it from itself or XORing it with itself is an interesting one, but it doesn't really translate to C.
Any optimising C compiler will use this technique if it makes sense, and trying to write it out explicitly is unlikely to achieve anything.
In C they only differ (for integer types) if your compiler sucks (or you disabled optimization like an MSVC answer shows).
Perhaps the person who told you this way trying to describe an asm instruction like sub reg,reg using C syntax, not talking about how such a statement would actually compile with a modern optimizing compiler? In which case I wouldn't say "very different" for most x86 CPUs; most do special case sub same,same as a zeroing idiom, like xor same,same. What is the best way to set a register to zero in x86 assembly: xor, mov or and?
That makes an asm sub reg,reg similar to mov reg,0, with somewhat better code size. (But yes, some unique benefits wrt. partial-register renaming on Intel P6-family that you can only get from zeroing idioms, not mov).
They could differ in C if your compiler is trying to implement the mostly-deprecated memory_order_consume semantics from <stdatomic.h> on a weakly-ordered ISA like ARM or PowerPC, where n=0 breaks the dependency on the old value but n = n-n; still "carries a dependency", so a load like array[n] will be dependency-ordered after n = atomic_load_explicit(&shared_var, memory_order_consume). See Memory order consume usage in C11 for more details
In practice compilers gave up on trying to get that dependency-tracking right and promote consume loads to acquire. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html and When should you not use [[carries_dependency]]?
But in asm for weakly-ordered ISAs, sub dst, same, same is required to stil carry a dependency on the input register, just like in C. (Most weakly-ordered ISAs are RISCs with fixed-width instructions so avoiding an immediate operand doesn't make the machine code any smaller. Thus there is no historical use of shorter zeroing idioms like sub r1, r1, r1 even on ISAs like ARM that don't have an architectural zero register. mov r1, #0 is the same size and at least as efficient as any other way. On MIPS you'd just move $v0, $zero)
So yes, for those non-x86 ISAs, they are very different in asm. n=0 avoids any false dependency on the old value of the variable (register), while n=n-n can't execute until the old value of n is ready.
Only x86 special-cases sub same,same and xor same,same as a dependency-breaking zeroing idiom like mov eax, imm32, because mov eax, 0 is 5 bytes but xor eax,eax is only 2. So there was a long history of using this peephole optimization before out-of-order execution CPUs, and such CPUs needed to run existing code efficiently. What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains the details.
Unless you're writing by hand in x86 asm, write 0 like a normal person instead of n-n or n^n, and let the compiler use xor-zeroing as a peephole optimization.
Asm for other ISAs might have other peepholes, e.g. another answer mentions m68k. But again, if you're writing in C this is the compiler's job. Write 0 when you mean 0. Trying to "hand hold" the compiler into using an asm peephole is very unlikely to work with optimization disabled, and with optimization enabled the compiler will efficiently zero a register if it needs to.
not sure about assembly and such, but generally,
n=0
n=n-n
isnt always equal if n is floating point, see here
http://www.codinghorror.com/blog/archives/001266.html
Here are some corner cases where the behavior is different for n = 0 and n = n - n:
if n has a floating point type, the result will differ from 0 for specific values: -0.0, Infinity, -Infinity, NaN...
if n is defined as volatile: the first expression will generate a single store into the corresponding memory location, while the second expression will generate two loads and a store, furthermore if n is the location of a hardware register, the 2 loads might yield different values, causing the write to store a non 0 value.
if optimisations are disabled, the compiler might generate different code for these 2 expressions even for plain int n, which might or might not execute at the speed.

Resources