This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 5 years ago.
I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?
Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.
Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.
Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:
INC EAX
gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:
INC DWORD PTR [ EAX ]
the low-level decomposition though would look more like:
uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg
and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.
The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?
You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.
Related
I'd like to ask if using a CMPXCHG instruction on an 8-bit memory field would be worse in any aspect than using it on a 32-bit field.
I'm using C11 stdatomic.h to implement a couple of synchronization methods.
No, there's no penalty for lock cmpxchg [mem], reg 8 vs. 32-bit. Modern x86 CPUs can load and store to their L1d cache with no penalty for a single byte vs. an aligned dword or qword. Can modern x86 hardware not store a single byte to memory? answer: it can with zero penalty1 because they spend the transistors to make even unaligned loads/stores fast.
The surrounding asm instructions dealing with a narrow integer in a register should also have negligible if any extra cost vs. [u]int32_t. See Why doesn't GCC use partial registers? - most compilers know how to be careful with partial registers, and modern CPUs (Haswell and later, and all non-Intel) don't rename the low 8 separately from the rest of the register so the only danger is false dependencies. Depending on exactly what you're doing, it might be best to use unsigned local temporaries with an _Atomic uint8_t, or it might be best to make your locals also uint8_t.
Footnote 1: Unlike on some non-x86 CPUs where a byte store actually is implemented with a cache RMW cycle (Are there any modern CPUs where a cached byte store is actually slower than a word store?). On those CPUs you'd hope that atomic xchg would be just as cheap for word vs. byte, but that's too much to hope for with cmpxchg. But almost all non-x86 ISAs have LL/SC instead of xchg / cmpxchg anyway, so even an atomic exchange is separate LL and SC instructions, and the SC would be take an RMW cycle to commit to cache.
When we compile code and execute it, in assembly, in which our code gets converted, functions are stored in a non sequential manner. So every time a function is called, the processor needs to throw away the instructions in the pipeline. Doesn't this affect the performance of the program?
PS: I'm not considering the time invested in developing such programs without functions. Purely on the performance level. Are there any ways in which compilers deal with this to reduce it?
So every time a function is called, the processor needs to throw away the instructions in the pipeline.
No, everything after the decode stage is still good. The CPU knows not to keep decoding after an unconditional branch (like a jmp, call, or ret). Only the instructions that have been fetched but not yet decoded are ones that shouldn't run. Until the target address is decoded from the instruction, there's nothing useful for beginning of the pipeline to do, so you get bubbles in the pipeline until the target address is known. Decoding branch instructions as early as possible thus minimizes the penalty for taken branches.
In the classic RISC pipeline, the stages are IF ID EX MEM WB (fetch, decode, execute, mem, write-back (results to registers). So when ID decodes a branch instruction, the pipeline throws away the instruction currently being fetched in IF, and the instruction currently being decoded in ID (because it's the instruction after the branch).
"Hazard" is the term for things that prevent a steady stream of instructions from going through the pipeline at one per clock. Branches are a Control Hazard. (Control as in flow-control, as opposed to data.)
If the branch target isn't in L1 I-cache, the pipeline will have to wait for instructions to streaming in from memory before the IF pipeline stage can produce a fetched instruction. I-cache misses always create a pipeline bubble. Prefetching usually avoids this for non-branching code.
More complex CPUs decode far enough ahead to detect branches and re-steer fetch soon enough to hide this bubble. This may involve a queue of decoded instructions to hide the fetch bubble.
Also, instead of actually decoding to detect branch instructions, the CPU can check every instruction address against a "Branch Target Buffer" cache. If you get a hit, you know the instruction is a branch even though you haven't decoded it yet. The BTB also holds the target address, so you can start fetching from there right away (if it's an unconditional branch or your CPU supports speculative execution based on branch prediction).
ret is actually the harder case: the return address is in a register or on the stack, not encoded directly into the instruction. It's an unconditional indirect branch. Modern x86 CPUs maintain an internal return-address predictor stack, and perform very badly when you mis-match call/ret instructions. E.g. call label / label: pop ebx is terrible for position-independent 32bit code to get EIP into EBX. That will cause a mis-predict for the next 15 or so rets up the call tree.
I think I've read that a return-address predictor stack is used by some other non-x86 microarchitectures.
See Agner Fog's microarchitecture pdf to learn more about how x86 CPUs behave (also see the x86 tag wiki), or read a computer architecture textbook to learn about simple RISC pipelines.
For more about caches and memory (mostly focused on data caching / prefetching), see Ulrich Drepper's What Every Programmer Should Know About Memory.
An unconditional branch is quite cheap, like usually a couple cycles at worst (not including I-cache misses).
The big cost of a function call is when the compiler can't see the definition of the target function, and has to assume it clobbers all the call-clobbered registers in the calling convention. (In x86-64 SystemV, all the float/vector registers, and about 8 integer registers.) This requires either spilling to memory or keeping live data in call-preserved registers. But that means the function has to save/restore those register to not break the caller.
Inter-procedural optimization to let functions take advantage of knowing which registers other functions actually clobber, and which they don't, is something compilers can do within the same compilation unit. Or even across compilation units with link-time whole-program optimization. But it can't extend across dynamic-linking boundaries, because the compiler isn't allowed to make code that will break with a differently-compiled version of the same shared library.
Are there any ways in which compilers deal with this to reduce it?
They inline small functions, or even large static functions that are only called once.
e.g.
int foo(void) { return 1; }
mov eax, 1 #,
ret
int bar(int x) { return foo() + x;}
lea eax, [rdi+1] # D.2839,
ret
As #harold points out, overdoing it with inlining can cause cache misses, too, because it inflates your code size so much that not all of your hot code fits in cache.
Intel SnB-family designs have a small but very fast uop cache that caches decoded instructions. It only holds at most 1536 uops IIRC, in lines of 6 uops each. Executing from uop cache instead of from the decoders shortens the branch-mispredict penalty from 19 to 15 cycles, IIRC (something like that, but those numbers are probably not actually correct for any specific uarch). There's also a significant frontend throughput boost compared to the decoders, esp. for long instructions which are common in vector code.
Question:
Is accessing the stack the same speed as accessing memory?
For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.
So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)
Motivation for Question:
It is common in C to discourage trivial functions that take parameters.
I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.
But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.
(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)
Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.
Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.
Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).
If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.
Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.
I'd say there's no definitive answer for all cases because performance depends on:
your hardware
your compiler
your program and its memory accessing patterns
For the clock-cycle-curious...
For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).
You then get, on a Pentium 4 chip:
push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax
As far as the follow-on question in the comments (3rd comment):
indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.
Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.
(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)
Understanding Latency & Throughput
A bit more may be needed to understand timing tables for modern CPUs. These should help:
definitions of latency and throughput
a useful analogy for latency and throughput, and their relation to instruction processing pipelines)
I need to understand memory fences in multicore machines. Say I have this code
Core 1
mov [_x], 1; mov r1, [_y]
Core 2
mov [_y], 1; mov r2, [_x]
Now the unexpected results without memory fences would be that both r1 and r2 can be 0 after execution. In my opinion, to counter that problem, we should put memory fence in both codes, as putting it to only one would still not solve the problem. Something like as follows...
Core 1
mov [_x], 1; memory_fence; mov r1, [_y]
Core 2
mov [_y], 1; memory_fence; mov r2, [_x]
Is my understanding correct or am I still missing something? Assume the architecture is x86. Also, can someone tell me how to put memory fences in a C++ code?
Fences serialize the operation that they fence (loads & stores), that is, no other operation may start till the fence is executed, but the fence will not execute till all preceding operations have completed. quoting intel makes the meaning of this a little more precise (taken from the MFENCE instruction, page 3-628, Vol. 2A, Intel Instruction reference):
This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes globally visible before any load or store instruction that
follows the MFENCE instruction.1
A load instruction is considered to become globally visible when
the value to be loaded into its destination register is determined.
Using fences in C++ is tricky (C++11 may have fence semantics somewhere, maybe someone else has info on that), as it is platform and compiler dependent. For x86 using MSVC or ICC, you can use the _mm_lfence, _mm_sfence & _mm_mfence for load, store and load + store fencing (note that some of these are SSE2 instructions).
Note: this assumes an Intel perspective, that is: one using an x86 (32 or 64 bit) or IA64 processor
C++11 (ISO/IEC 14882:2011) defines a multi-threading-aware memory model.
Although I don't know of any compiler that currently implements the new memory model, C++ Concurrency in Action by Anthony Williams documents it very well. You may check Chapter 5 - The C++ Memory Model and Operations on Atomic Types where he explains about relaxed operations and memory fences. Also, he is the author of the just::thread library that may be used till we have compiler vendor support of the new standard.
just::thread is the base for the boost::thread library.
How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? Can it be any more specific if its written only for the i7 processor?
Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?
Thank you.
The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)
In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.
You can use the CMPXCHG instruction with the LOCK prefix for atomic execution.
E.g.
lock cmpxchg DWORD PTR [ebx], edx
or
lock cmpxchgl %edx, (%ebx)
This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.
You need to have a 486 or later for this instruction to be available.
If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86.
Variable must be 8 byte aligned.
Example:
mov eax, OldDataA //load Old first 32 bits
mov edx, OldDataB //load Old second 32 bits
mov ebx, NewDataA //load first 32 bits
mov ecx, NewDataB //load second 32 bits
mov edi, Destination //load destination pointer
lock cmpxchg8b qword ptr [edi]
setz al //if transfer is succesful the al is 1 else 0
If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.
In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference
Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.
It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.
A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.