Difference between n = 0 and n = n - n - c

When I read this question I remembered someone once telling me (many years ago) that from an assembler-point-of-view, these two operations are very different:
n = 0;
n = n - n;
Is this true, and if it is, why is it so?
EDIT: As pointed out by some replies, I guess this would be fairly easy for a compiler to optimize into the same thing. But what I find interesting is why they would differ if the compiler had a completely general approach.

Writing assembler code you often used:
xor eax, eax
instead of
mov eax, 0
That is because with the first statement you have only the opcode and no involved argument. Your CPU will do that in 1 cylce (instead of 2). I think your case is something similar (although using sub).

Compiler VC++ 6.0, without optimisations:
4: n = 0;
0040102F mov dword ptr [ebp-4],0
5:
6: n = n - n;
00401036 mov eax,dword ptr [ebp-4]
00401039 sub eax,dword ptr [ebp-4]
0040103C mov dword ptr [ebp-4],eax

In the early days, memory and CPU cycles were scarce. That lead to a lot of so called "peep-hole optimizations". Let's look at the code:
move.l #0,d0
moveq.l #0,d0
sub.l a0,a0
The first instruction would need two bytes for the op-code and then four bytes for the value (0). That meant four bytes wasted plus you'd need to access the memory twice (once for the opcode and once for the data). Sloooow.
moveq.l was better since it would merge the data into the op-code but it only allowed to write values between 0 and 7 into a register. And you were limited to data registers only, there was no quick way to clear an address register. You'd have to clear a data register and then load the data register into an address register (two op-codes. Bad.).
Which lead to the last operation which works on any register, need only two bytes, a single memory read. Translated into C, you'd get
n = n - n;
which would work for most often used types of n (integer or pointer).

An optimizing compiler will produce the same assembly code for the two.

It may depend on whether n is declared as volatile or not.

The assembly-language technique of zeroing a register by subtracting it from itself or XORing it with itself is an interesting one, but it doesn't really translate to C.
Any optimising C compiler will use this technique if it makes sense, and trying to write it out explicitly is unlikely to achieve anything.

In C they only differ (for integer types) if your compiler sucks (or you disabled optimization like an MSVC answer shows).
Perhaps the person who told you this way trying to describe an asm instruction like sub reg,reg using C syntax, not talking about how such a statement would actually compile with a modern optimizing compiler? In which case I wouldn't say "very different" for most x86 CPUs; most do special case sub same,same as a zeroing idiom, like xor same,same. What is the best way to set a register to zero in x86 assembly: xor, mov or and?
That makes an asm sub reg,reg similar to mov reg,0, with somewhat better code size. (But yes, some unique benefits wrt. partial-register renaming on Intel P6-family that you can only get from zeroing idioms, not mov).
They could differ in C if your compiler is trying to implement the mostly-deprecated memory_order_consume semantics from <stdatomic.h> on a weakly-ordered ISA like ARM or PowerPC, where n=0 breaks the dependency on the old value but n = n-n; still "carries a dependency", so a load like array[n] will be dependency-ordered after n = atomic_load_explicit(&shared_var, memory_order_consume). See Memory order consume usage in C11 for more details
In practice compilers gave up on trying to get that dependency-tracking right and promote consume loads to acquire. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html and When should you not use [[carries_dependency]]?
But in asm for weakly-ordered ISAs, sub dst, same, same is required to stil carry a dependency on the input register, just like in C. (Most weakly-ordered ISAs are RISCs with fixed-width instructions so avoiding an immediate operand doesn't make the machine code any smaller. Thus there is no historical use of shorter zeroing idioms like sub r1, r1, r1 even on ISAs like ARM that don't have an architectural zero register. mov r1, #0 is the same size and at least as efficient as any other way. On MIPS you'd just move $v0, $zero)
So yes, for those non-x86 ISAs, they are very different in asm. n=0 avoids any false dependency on the old value of the variable (register), while n=n-n can't execute until the old value of n is ready.
Only x86 special-cases sub same,same and xor same,same as a dependency-breaking zeroing idiom like mov eax, imm32, because mov eax, 0 is 5 bytes but xor eax,eax is only 2. So there was a long history of using this peephole optimization before out-of-order execution CPUs, and such CPUs needed to run existing code efficiently. What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains the details.
Unless you're writing by hand in x86 asm, write 0 like a normal person instead of n-n or n^n, and let the compiler use xor-zeroing as a peephole optimization.
Asm for other ISAs might have other peepholes, e.g. another answer mentions m68k. But again, if you're writing in C this is the compiler's job. Write 0 when you mean 0. Trying to "hand hold" the compiler into using an asm peephole is very unlikely to work with optimization disabled, and with optimization enabled the compiler will efficiently zero a register if it needs to.

not sure about assembly and such, but generally,
n=0
n=n-n
isnt always equal if n is floating point, see here
http://www.codinghorror.com/blog/archives/001266.html

Here are some corner cases where the behavior is different for n = 0 and n = n - n:
if n has a floating point type, the result will differ from 0 for specific values: -0.0, Infinity, -Infinity, NaN...
if n is defined as volatile: the first expression will generate a single store into the corresponding memory location, while the second expression will generate two loads and a store, furthermore if n is the location of a hardware register, the 2 loads might yield different values, causing the write to store a non 0 value.
if optimisations are disabled, the compiler might generate different code for these 2 expressions even for plain int n, which might or might not execute at the speed.

Related

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

Segmentation fault when attempting to print int value from x86 external function [duplicate]

I've noticed that a lot of calling conventions insist that [e]bx be preserved for the callee.
Now, I can understand why they'd preserve something like [e]sp or [e]bp, since that can mess up the callee's stack. I can also understand why you might want to preserve [e]si or [e]di since that can break the callee's string instructions if they aren't particularly careful.
But [e]bx? What on earth is so important about [e]bx? What makes [e]bx so special that multiple calling conventions insist that it be preserved throughout function calls?
Is there some sort of subtle bug/gotcha that can arise from messing with [e]bx?
Does modifying [e]bx somehow have a greater impact on the callee than modifying [e]dx or [e]cx for instance?
I just don't understand why so many calling conventions single out [e]bx for preservation.
Not all registers make good candidates for preserving:
no (e)ax -- Implicitly used in some instructions; Return value
no (e)dx -- edx:eax is implicity used in cdq, div, mul and in return values
(e)bx -- generic register, usable in 16-bit addressing modes (base)
(e)cx -- shift-counts, used in loop, rep
(e)si -- movs operations, usable in 16-bit addressing modes (index)
(e)di -- movs operations, usable in 16-bit addressing modes (index)
Must (e)bp -- frame pointer, usable in 16-bit addressing modes (base)
Must (e)sp -- stack pointer, not addressable in 8086 (other than push/pop)
Looking at the table, two registers have good reason to be preserved and two have a reason not to be preserved. accumulator = (e)ax e.g. is the most often used register due to short encoding. SI,DI make a logical register pair -- on REP MOVS and other string operations, both are trashed.
In a half and half callee/caller saving paradigm the discussion would basically go only if bx/cx is preferred over si/di. In other calling conventions, it's just EDX,EAX and ECX that can be trashed.
EBX does have a few obscure implicit uses that are still relevant in modern code (e.g. CMPXGH8B / CMPXGH16B), but it's the least special register in 32/64-bit code.
EBX makes a good choice for a call-preserved register because it's rare that a function will need to save/restore EBX because they need EBX specifically, and not just any non-volatile register. As Brett Hale's answer points out, it makes EBX a great choice for the global offset table (GOT) pointer in ABIs that need one.
In 16-bit mode, addressing modes were limited to (any subset of) [BP|BX + DI|SI + disp8/disp16]), so BX is definitely special there.
This is a compromise between not saving any of the registers and saving them all. Either saving none, or saving all, could have been proposed, but either extreme leads to inefficiencies caused by copying the contents to memory (the stack). Choosing to allow some registers to be preserved and some not, reduces the average cost of a function call.
One of the main reasons, certainly for the i386 ELF ABI, is that ebx holds the address of the global offset table (GOT) register for position-independent code (PIC). See 3-35 of the specification for the details. It would be disruptive in the extreme, if, say, shared library code had to restore the GOT after every function call return.

Why is there no inbuilt swap function in C but there is xchg in Assembly?

Recently I came across Assembly language. x86 assembly has an xchg instruction which swaps the contents of two registers.
Since every C code is first converted to Assembly, it would have been nice if there was a swap function inbuilt in C like in the header stdio.h. Then whenever the compiler detects the swap function, it could add the xchg directive in the assembly file.
So why this swap function was not implemented in C?
C is a cross-platform language. Assembly is architecture specific. Not every architecture has such an instruction. Moreover, C, as a high-level language doesn't have to correspond to the machine-level instruction set and features, as it's purpose is to bridge between the "human" language and the machine language, not to mimic it. Said that, a C compiler for this specific architecture might have an extension for this swapping instruction or optimize the swapping code to use this instruction if smart enough.
There are two points which can explain why swap() is not in C
1. Function call semantics:
Including a swap() function would break a very fundamental design decision in C: swap() can only work with pass-by-reference semantics (which C++ added to the language, but which are absent in C), not with pass-by-value.
2. Diversity of available assembler instructions
Apart from that, there is usually quite a number of assembler instructions on any given CPU architecture which are totally inaccessible from pure C. This includes instructions as diverse as interrupt handling instructions, virtual memory space manipulating instructions, I/O instructions, bit fiddling instructions (google the PPC instruction rlwimi for an especially powerful example of this), etc.
It is simply impossible to include any significant number of these in a general purpose language like C.
Some of these are crucial for implementing operating systems, which is why any OS must include at the very least some small amounts of assembler code. They are usually encapsulated in some functions with inline assembler or defined in the kernel headers as preprocessor directives. Other instructions are less important, or only good for optimizations, these may be generated by optimizing compilers, and many compilers do generate them (the whole class of vector functions fall in this category).
In the face of this vast diversity, the designers of C just had to cut it somewhere. And they opted for providing whatever is representable as simple operators like (+, -, ~, &, |, !, &&, ||, etc.), but did not provide anything that would require function call syntax like the swap() function you propose.
That would work for variables that fit in the register and are in the register. It would not work for large struct or variables held in memory (If you load a variable A in reg X and another, say B in reg Y, and swap them, you could skip the swapping and load A in Y and B in X directly).
Having said said, nothing prevent the compiler for a given architecture to use the swap instruction to compile:
int a;
int b;
int tmp;
tmp=a;
a=b;
b=tmp;
... If those happens to be in registers: the fact that it is not in C does not mean the compiler does not use it.
Besides what the other correct answers say, another part of your premise is wrong.
Only a really dumb compiler would want to actually emit xchg every time the source swapped variables, whether there's an intrinsic or operator for it or not. Optimizing compilers don't just transliterate C into asm, they typically convert to an SSA internal representation of the program logic, and optimize that so they can implement it with as few instructions as possible (or really in the most efficient way possible; using multiple fast instructions can be better than a single slower one).
xchg is rarely faster than 3 mov instructions, and a good compiler can simply change its local-variable <-> CPU register mapping without emitting any asm instructions in many cases. (Or inside a loop, unrolling can often optimize away swapping.) Often you need only 1 or mov instructions in asm, not all 3. e.g. if only one of the C vars being swapped needs to stay in the same register, you can do:
# start: x in EAX, y in ECX
mov edx, eax
mov eax, ecx
# end: y in EAX, x in EDX
See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Also note that xchg [mem], reg is atomic (implicit lock prefix), and thus is a full memory barrier, and much slower than 3 mov instructions, and with much higher impact on surrounding code because of the memory-barrier effect.
If you do actually need to exchange registers, 3x mov is pretty good. Often better than xchg reg,reg because of mov elimination, at the cost of more code-size and a tmp reg.
There's a reason compilers never use xchg. If xchg was a win, compilers would look for it as a peephole optimization the same way they look for inc eax over add eax,1, or xor eax,eax instead of mov eax,0. But they don't.
(semi-related: swapping 2 registers in 8086 assembly language(16 bits))
Even though xchg is a very elementary instruction, this doesn't mean C must have its equivalent. The fact that C sometimes maps directly to assembly is not very relevant; the standard says nothing about "assembly" (why map to assembly and not another low-level language?).
You might also ask: Why does C not have built-in vector instructions? They're becoming largely available!
There's also compiler's help: swapping variables is a very visible pattern, so such optimization shouldn't be hard to implement. And you also have inline asm, should you need it.

Compare and swap in machine code in C

How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? Can it be any more specific if its written only for the i7 processor?
Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?
Thank you.
The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)
In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.
You can use the CMPXCHG instruction with the LOCK prefix for atomic execution.
E.g.
lock cmpxchg DWORD PTR [ebx], edx
or
lock cmpxchgl %edx, (%ebx)
This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.
You need to have a 486 or later for this instruction to be available.
If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86.
Variable must be 8 byte aligned.
Example:
mov eax, OldDataA //load Old first 32 bits
mov edx, OldDataB //load Old second 32 bits
mov ebx, NewDataA //load first 32 bits
mov ecx, NewDataB //load second 32 bits
mov edi, Destination //load destination pointer
lock cmpxchg8b qword ptr [edi]
setz al //if transfer is succesful the al is 1 else 0
If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.
In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference
Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.
It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.
A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.

"register" keyword in C?

What does the register keyword do in C language? I have read that it is used for optimizing but is not clearly defined in any standard. Is it still relevant and if so, when would you use it?
It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
Most modern compilers do that automatically, and are better at picking them than us humans.
I'm surprised that nobody mentioned that you cannot take an address of register variable, even if compiler decides to keep variable in memory rather than in register.
So using register you win nothing (anyway compiler will decide for itself where to put the variable) and lose the & operator - no reason to use it.
It tells the compiler to try to use a CPU register, instead of RAM, to store the variable. Registers are in the CPU and much faster to access than RAM. But it's only a suggestion to the compiler, and it may not follow through.
I know this question is about C, but the same question for C++ was closed as a exact duplicate of this question. This answer therefore may not apply for C.
The latest draft of the C++11 standard, N3485, says this in 7.1.1/3:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]
In C++ (but not in C), the standard does not state that you can't take the address of a variable declared register; however, because a variable stored in a CPU register throughout its lifetime does not have a memory location associated with it, attempting to take its address would be invalid, and the compiler will ignore the register keyword to allow taking the address.
I have read that it is used for optimizing but is not clearly defined in any standard.
In fact it is clearly defined by the C standard. Quoting the N1570 draft section 6.7.1 paragraph 6 (other versions have the same wording):
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.
The unary & operator may not be applied to an object defined with register, and register may not be used in an external declaration.
There are a few other (fairly obscure) rules that are specific to register-qualified objects:
Defining an array object with register has undefined behavior.
Correction: It's legal to define an array object with register, but you can't do anything useful with such an object (indexing into an array requires taking the address of its initial element).
The _Alignas specifier (new in C11) may not be applied to such an object.
If the parameter name passed to the va_start macro is register-qualified, the behavior is undefined.
There may be a few others; download a draft of the standard and search for "register" if you're interested.
As the name implies, the original meaning of register was to require an object to be stored in a CPU register. But with improvements in optimizing compilers, this has become less useful. Modern versions of the C standard don't refer to CPU registers, because they no longer (need to) assume that there is such a thing (there are architectures that don't use registers). The common wisdom is that applying register to an object declaration is more likely to worsen the generated code, because it interferes with the compiler's own register allocation. There might still be a few cases where it's useful (say, if you really do know how often a variable will be accessed, and your knowledge is better than what a modern optimizing compiler can figure out).
The main tangible effect of register is that it prevents any attempt to take an object's address. This isn't particularly useful as an optimization hint, since it can be applied only to local variables, and an optimizing compiler can see for itself that such an object's address isn't taken.
It hasn't been relevant for at least 15 years as optimizers make better decisions about this than you can. Even when it was relevant, it made a lot more sense on a CPU architecture with a lot of registers, like SPARC or M68000 than it did on Intel with its paucity of registers, most of which are reserved by the compiler for its own purposes.
Actually, register tells the compiler that the variable does not alias with
anything else in the program (not even char's).
That can be exploited by modern compilers in a variety of situations, and can help the compiler quite a bit in complex code - in simple code the compilers can figure this out on their own.
Otherwise, it serves no purpose and is not used for register allocation. It does not usually incur performance degradation to specify it, as long as your compiler is modern enough.
Storytime!
C, as a language, is an abstraction of a computer. It allows you to do things, in terms of what a computer does, that is manipulate memory, do math, print things, etc.
But C is only an abstraction. And ultimately, what it's extracting from you is Assembly language. Assembly is the language that a CPU reads, and if you use it, you do things in terms of the CPU. What does a CPU do? Basically, it reads from memory, does math, and writes to memory. The CPU doesn't just do math on numbers in memory. First, you have to move a number from memory to memory inside the CPU called a register. Once you're done doing whatever you need to do to this number, you can move it back to normal system memory. Why use system memory at all? Registers are limited in number. You only get about a hundred bytes in modern processors, and older popular processors were even more fantastically limited (The 6502 had 3 8-bit registers for your free use). So, your average math operation looks like:
load first number from memory
load second number from memory
add the two
store answer into memory
A lot of that is... not math. Those load and store operations can take up to half your processing time. C, being an abstraction of computers, freed the programmer the worry of using and juggling registers, and since the number and type vary between computers, C places the responsibility of register allocation solely on the compiler. With one exception.
When you declare a variable register, you are telling the compiler "Yo, I intend for this variable to be used a lot and/or be short lived. If I were you, I'd try to keep it in a register." When the C standard says compilers don't have to actually do anything, that's because the C standard doesn't know what computer you're compiling for, and it might be like the 6502 above, where all 3 registers are needed just to operate, and there's no spare register to keep your number. However, when it says you can't take the address, that's because registers don't have addresses. They're the processor's hands. Since the compiler doesn't have to give you an address, and since it can't have an address at all ever, several optimizations are now open to the compiler. It could, say, keep the number in a register always. It doesn't have to worry about where it's stored in computer memory (beyond needing to get it back again). It could even pun it into another variable, give it to another processor, give it a changing location, etc.
tl;dr: Short-lived variables that do lots of math. Don't declare too many at once.
You are messing with the compiler's sophisticated graph-coloring algorithm. This is used for register allocation. Well, mostly. It acts as a hint to the compiler -- that's true. But not ignored in its entirety since you are not allowed to take the address of a register variable (remember the compiler, now on your mercy, will try to act differently). Which in a way is telling you not to use it.
The keyword was used long, long back. When there were only so few registers that could count them all using your index finger.
But, as I said, deprecated doesn't mean you cannot use it.
Just a little demo (without any real-world purpose) for comparison: when removing the register keywords before each variable, this piece of code takes 3.41 seconds on my i7 (GCC), with register the same code completes in 0.7 seconds.
#include <stdio.h>
int main(int argc, char** argv) {
register int numIterations = 20000;
register int i=0;
unsigned long val=0;
for (i; i<numIterations+1; i++)
{
register int j=0;
for (j;j<i;j++)
{
val=j+i;
}
}
printf("%d", val);
return 0;
}
I have tested the register keyword under QNX 6.5.0 using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
int main(int argc, char *argv[]) {
uint64_t cps, cycle1, cycle2, ncycles;
double sec;
register int a=0, b = 1, c = 3, i;
cycle1 = ClockCycles();
for(i = 0; i < 100000000; i++)
a = ((a + b + c) * c) / 2;
cycle2 = ClockCycles();
ncycles = cycle2 - cycle1;
printf("%lld cycles elapsed\n", ncycles);
cps = SYSPAGE_ENTRY(qtime) -> cycles_per_sec;
printf("This system has %lld cycles per second\n", cps);
sec = (double)ncycles/cps;
printf("The cycles in seconds is %f\n", sec);
return EXIT_SUCCESS;
}
I got the following results:
-> 807679611 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.244600
And now without register int:
int a=0, b = 1, c = 3, i;
I got:
-> 1421694077 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.430700
During the seventies, at the very beginning of the C language, the register keyword has been introduced in order to allow the programmer to give hints to the compiler, telling it that the variable would be used very often, and that it should be wise to keep it’s value in one of the processor’s internal register.
Nowadays, optimizers are much more efficient than programmers to determine variables that are more likely to be kept into registers, and the optimizer does not always take the programmer’s hint into account.
So many people wrongly recommend not to use the register keyword.
Let’s see why!
The register keyword has an associated side effect: you can not reference (get the address of) a register type variable.
People advising others not to use registers takes wrongly this as an additional argument.
However, the simple fact of knowing that you can not take the address of a register variable, allows the compiler (and its optimizer) to know that the value of this variable can not be modified indirectly through a pointer.
When at a certain point of the instruction stream, a register variable has its value assigned in a processor’s register, and the register has not been used since to get the value of another variable, the compiler knows that it does not need to re-load the value of the variable in that register.
This allows to avoid expensive useless memory access.
Do your own tests and you will get significant performance improvements in your most inner loops.
c_register_side_effect_performance_boost
Register would notify the compiler that the coder believed this variable would be written/read enough to justify its storage in one of the few registers available for variable use. Reading/writing from registers is usually faster and can require a smaller op-code set.
Nowadays, this isn't very useful, as most compilers' optimizers are better than you at determining whether a register should be used for that variable, and for how long.
gcc 9.3 asm output, without using optimisation flags (everything in this answer refers to standard compilation without optimisation flags):
#include <stdio.h>
int main(void) {
int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
#include <stdio.h>
int main(void) {
register int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
add ebx, 1
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
add rsp, 8
pop rbx
pop rbp
ret
This forces ebx to be used for the calculation, meaning it needs to be pushed to the stack and restored at the end of the function because it is callee saved. register produces more lines of code and 1 memory write and 1 memory read (although realistically, this could have been optimised to 0 R/Ws if the calculation had been done in esi, which is what happens using C++'s const register). Not using register causes 2 writes and 1 read (although store to load forwarding will occur on the read). This is because the value has to be present and updated directly on the stack so the correct value can be read by address (pointer). register doesn't have this requirement and cannot be pointed to. const and register are basically the opposite of volatile and using volatile will override the const optimisations at file and block scope and the register optimisations at block-scope. const register and register will produce identical outputs because const does nothing on C at block-scope, so only the register optimisations apply.
On clang, register is ignored but const optimisations still occur.
On supported C compilers it tries to optimize the code so that variable's value is held in an actual processor register.
Microsoft's Visual C++ compiler ignores the register keyword when global register-allocation optimization (the /Oe compiler flag) is enabled.
See register Keyword on MSDN.
Register keyword tells compiler to store the particular variable in CPU registers so that it could be accessible fast. From a programmer's point of view register keyword is used for the variables which are heavily used in a program, so that compiler can speedup the code. Although it depends on the compiler whether to keep the variable in CPU registers or main memory.
Register indicates to compiler to optimize this code by storing that particular variable in registers then in memory. it is a request to compiler, compiler may or may not consider this request.
You can use this facility in case where some of your variable are being accessed very frequently.
For ex: A looping.
One more thing is that if you declare a variable as register then you can't get its address as it is not stored in memory. it gets its allocation in CPU register.
The register keyword is a request to the compiler that the specified variable is to be stored in a register of the processor instead of memory as a way to gain speed, mostly because it will be heavily used. The compiler may ignore the request.

Resources