Why is there no inbuilt swap function in C but there is xchg in Assembly?

Why is there no inbuilt swap function in C but there is xchg in Assembly? - c

Recently I came across Assembly language. x86 assembly has an xchg instruction which swaps the contents of two registers.
Since every C code is first converted to Assembly, it would have been nice if there was a swap function inbuilt in C like in the header stdio.h. Then whenever the compiler detects the swap function, it could add the xchg directive in the assembly file.
So why this swap function was not implemented in C?

C is a cross-platform language. Assembly is architecture specific. Not every architecture has such an instruction. Moreover, C, as a high-level language doesn't have to correspond to the machine-level instruction set and features, as it's purpose is to bridge between the "human" language and the machine language, not to mimic it. Said that, a C compiler for this specific architecture might have an extension for this swapping instruction or optimize the swapping code to use this instruction if smart enough.

There are two points which can explain why swap() is not in C
1. Function call semantics:
Including a swap() function would break a very fundamental design decision in C: swap() can only work with pass-by-reference semantics (which C++ added to the language, but which are absent in C), not with pass-by-value.
2. Diversity of available assembler instructions
Apart from that, there is usually quite a number of assembler instructions on any given CPU architecture which are totally inaccessible from pure C. This includes instructions as diverse as interrupt handling instructions, virtual memory space manipulating instructions, I/O instructions, bit fiddling instructions (google the PPC instruction rlwimi for an especially powerful example of this), etc.
It is simply impossible to include any significant number of these in a general purpose language like C.
Some of these are crucial for implementing operating systems, which is why any OS must include at the very least some small amounts of assembler code. They are usually encapsulated in some functions with inline assembler or defined in the kernel headers as preprocessor directives. Other instructions are less important, or only good for optimizations, these may be generated by optimizing compilers, and many compilers do generate them (the whole class of vector functions fall in this category).
In the face of this vast diversity, the designers of C just had to cut it somewhere. And they opted for providing whatever is representable as simple operators like (+, -, ~, &, |, !, &&, ||, etc.), but did not provide anything that would require function call syntax like the swap() function you propose.

That would work for variables that fit in the register and are in the register. It would not work for large struct or variables held in memory (If you load a variable A in reg X and another, say B in reg Y, and swap them, you could skip the swapping and load A in Y and B in X directly).
Having said said, nothing prevent the compiler for a given architecture to use the swap instruction to compile:
int a;
int b;
int tmp;
tmp=a;
a=b;
b=tmp;
... If those happens to be in registers: the fact that it is not in C does not mean the compiler does not use it.

Besides what the other correct answers say, another part of your premise is wrong.
Only a really dumb compiler would want to actually emit xchg every time the source swapped variables, whether there's an intrinsic or operator for it or not. Optimizing compilers don't just transliterate C into asm, they typically convert to an SSA internal representation of the program logic, and optimize that so they can implement it with as few instructions as possible (or really in the most efficient way possible; using multiple fast instructions can be better than a single slower one).
xchg is rarely faster than 3 mov instructions, and a good compiler can simply change its local-variable <-> CPU register mapping without emitting any asm instructions in many cases. (Or inside a loop, unrolling can often optimize away swapping.) Often you need only 1 or mov instructions in asm, not all 3. e.g. if only one of the C vars being swapped needs to stay in the same register, you can do:
# start: x in EAX, y in ECX
mov edx, eax
mov eax, ecx
# end: y in EAX, x in EDX
See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Also note that xchg [mem], reg is atomic (implicit lock prefix), and thus is a full memory barrier, and much slower than 3 mov instructions, and with much higher impact on surrounding code because of the memory-barrier effect.
If you do actually need to exchange registers, 3x mov is pretty good. Often better than xchg reg,reg because of mov elimination, at the cost of more code-size and a tmp reg.
There's a reason compilers never use xchg. If xchg was a win, compilers would look for it as a peephole optimization the same way they look for inc eax over add eax,1, or xor eax,eax instead of mov eax,0. But they don't.
(semi-related: swapping 2 registers in 8086 assembly language(16 bits))

Even though xchg is a very elementary instruction, this doesn't mean C must have its equivalent. The fact that C sometimes maps directly to assembly is not very relevant; the standard says nothing about "assembly" (why map to assembly and not another low-level language?).
You might also ask: Why does C not have built-in vector instructions? They're becoming largely available!
There's also compiler's help: swapping variables is a very visible pattern, so such optimization shouldn't be hard to implement. And you also have inline asm, should you need it.

Related

Why do compilers insist on using a callee-saved register here?

Consider this C code:
void foo(void);
long bar(long x) {
foo();
return x;
}
When I compile it on GCC 9.3 with either -O3 or -Os, I get this:
bar:
push r12
mov r12, rdi
call foo
mov rax, r12
pop r12
ret
The output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.
However, I want/expect to see assembly that looks more like this:
bar:
push rdi
call foo
pop rax
ret
Since you have to push something to the stack anyway, it seems shorter, simpler, and probably faster to just push your value there, instead of pushing some arbitrary callee-saved register's value there and then storing your value in that register. Ditto for the inverse after call foo when you're putting things back.
Is my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?
Godbolt link.
Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:
long foo(long);
long bar(long x) {
return foo(x * x) - x;
}
I get this:
bar:
push rbx
mov rbx, rdi
imul rdi, rdi
call foo
sub rax, rbx
pop rbx
ret
I'd rather have this:
bar:
push rdi
imul rdi, rdi
call foo
pop rdi
sub rax, rdi
ret
This time, it's only one instruction off vs. two, but the core concept is the same.
Godbolt link.

TL:DR:
Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
Inlining to create large functions is a better solution most of the time
There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.
Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.
I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/
Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.
If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.
Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)
If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.
Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)
So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.
It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)
IDK if it would be worth teaching compilers to look for this optimization
It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).
If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.
You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.
Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.
Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)
I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.
i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.
And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

Why do compilers insist on using a callee-saved register here?
Because most compilers would generate nearly the same code for a given function, and are following global calling conventions defined by the ABI targeted by your compiler.
You could define your own different calling conventions (e.g. passing even more function arguments in processor registers, or on the contrary "packing" by bitwise operations two short arguments in a single processor register, etc...), and implement your compiler following them. You probably would need to recode some of the C standard library (e.g. patch lower parts of GNU libc then recompile it, if on Linux).
IIRC, some calling conventions are different on Windows and on FreeBSD and on Linux for the same CPU.
Notice that with a recent GCC (e.g. GCC 10 in start of 2021) you could compile and link with gcc -O3 -flto -fwhole-program and in some cases get some inline expansion. You can also build GCC from its source code as a cross-compiler, and since GCC is free software, you can improve it to follow your private new calling conventions. Be sure to document your calling conventions first.
If performance matters to you a lot, you can consider writing your own GCC plugin doing even more optimizations. Your compiler plugin could even implement other calling conventions (e.g. using asmjit).
Consider also improving TinyCC or Clang or NWCC to fit your needs.
My opinion is that in many cases it is not worth spending months of your efforts to improve performance by just a few nanoseconds. But your employer/manager/client could disagree. Consider also compiling (or refactoring) significant parts of your software to silicon, e.g. thru VHDL, or using specialized hardware e.g. GPGPU with OpenCL or CUDA.

How does including assembly inline with C code work?

I've seen code for Arduino and other hardware that have assembly inline with C, something along the lines of:
asm("movl %ecx %eax"); /* moves the contents of ecx to eax */
__asm__("movb %bh (%eax)"); /*moves the byte from bh to the memory pointed by eax */
How does this actually Work? I realize every compiler is different, but what are the common reasons this is done, and how could someone take advantage of this?

The inline assembler code goes right into the complete assembled code untouched and in one piece. You do this when you really need absolutely full control over your instruction sequence, or maybe when you can't afford to let an optimizer have its way with your code. Maybe you need every clock tick. Maybe you need every single branch of your code to take the exact same number of clock ticks, and you pad with NOPs to make this happen.
In any case, lots of reasons why someone may want to do this, but you really need to know what you're doing. These chunks of code will be pretty opaque to your compiler, and its likely you won't get any warnings if you're doing something bad.

Usually the compiler will just insert the assembler instructions right into its generated assembler output. And it will do this with no regard for the consequences.
For example, in this code the optimiser is performing copy propagation, whereby it sees that y=x, then z=y. So it replaces z=y with z=x, hoping that this will allow it to perform further optimisations. Howver, it doesn't spot that I've messed with the value of x in the mean time.
char x=6;
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z should become 6, but actually gets
// the value of x, which is 3
To get around this, you can essentially tell the optimiser not to perform this optimisation for this variable.
volatile char x=6; // Tell the compiler that this variable could change
// all by itself, and any time, and therefore don't
// optimise with it.
char y,z;
y=x; // y becomes 6
_asm
rrncf x, 1 // x becomes 3. Optimiser doesn't see this happen!
_endasm
z=y; // z correctly gets the value of y, which is 6

Historically, C compilers generated assembly code, which would then be translated to machine code by an assembler. Inline assembly arises as a simple feature — in the intermediate assembly code, at that point, inject some user-picked code. Some compilers directly generate machine code, in which case they contain an assembler or call an external assembler to generate the machine code for the inline assembly snippets.
The most common use for assembly code is to use specialized processor instructions that the compiler isn't able to generate. For example, disabling interrupts for a critical section, controlling processor features (cache, MMU, MPU, power management, querying CPU capabilities, …), accessing coprocessors and hardware peripherals (e.g. inb/outb instructions on x86), etc. You'll rarely find asm("movl %ecx %eax"), because that affects general-purpose registers that the C code around it is also using, but something like asm("mcr p15, 0, 0, c7, c10, 5") has its use (data memory barrier on ARM). The OSDev wiki has several examples with code snippets.
Assembly code is also useful to implement features that break C's flow control model. A common example is context switching between threads (whether cooperative or preemptive, whether in the same address space or not) requiring assembly code to save and restore register values.
Assembly code is also useful to hand-optimize small bits of code for memory or speed. As compilers are getting smarter, this is rarely relevant at the application level nowadays, but it's still relevant in much of the embedded world.
There are two ways to combine assembly with C: with inline assembly, or by linking assembly modules with C modules. Linking is arguably cleaner but not always applicable: sometimes you need that one instruction in the middle of a function (e.g. for register saving on a context switch, a function call would clobber the registers), or you don't want to pay the cost of a function call.
Most C compilers support inline assembly, but the syntax varies. It is typically introduced by the keyword asm, _asm, __asm or __asm__. In addition to the assembly code itself, the inline assembly construct may contain additional code that allows you to pass values between assembly and C (for example, requesting that the value of a local variable is copied to a register on entry), or to declare that the assembly code clobbers or preserves certain registers.

asm("") and __asm__ are both valid usage. Basically, you can use __asm__ if the keyword asm conflicts with something in your program. If you have more than one instructions, you can write one per line in double quotes, and also suffix a ’\n’ and ’\t’ to the instruction. This is because gcc sends each instruction as a string to as(GAS) and by using the newline/tab you can send correctly formatted lines to the assembler. The code snippet in your question is basic inline.
In basic inline assembly, there is only instructions. In extended assembly, you can also specify the operands. It allows you to specify the input registers, output registers and a list of clobbered registers. It is not mandatory to specify the registers to use, you can leave that to GCC and that probably fits into GCC’s optimization scheme better. An example for the extended asm is:
__asm__ ("movl %eax, %ebx\n\t"
"movl $56, %esi\n\t"
"movl %ecx, $label(%edx,%ebx,$4)\n\t"
"movb %ah, (%ebx)");
Notice that the '\n\t' at the end of each line except the last, and each line is enclosed in quotes. This is because gcc sends each as instruction to as as a string as I mentioned before. The newline/tab combination is required so that the lines are fed to as according to the correct format.

How do memory fences work?

I need to understand memory fences in multicore machines. Say I have this code
Core 1
mov [_x], 1; mov r1, [_y]
Core 2
mov [_y], 1; mov r2, [_x]
Now the unexpected results without memory fences would be that both r1 and r2 can be 0 after execution. In my opinion, to counter that problem, we should put memory fence in both codes, as putting it to only one would still not solve the problem. Something like as follows...
Core 1
mov [_x], 1; memory_fence; mov r1, [_y]
Core 2
mov [_y], 1; memory_fence; mov r2, [_x]
Is my understanding correct or am I still missing something? Assume the architecture is x86. Also, can someone tell me how to put memory fences in a C++ code?

Fences serialize the operation that they fence (loads & stores), that is, no other operation may start till the fence is executed, but the fence will not execute till all preceding operations have completed. quoting intel makes the meaning of this a little more precise (taken from the MFENCE instruction, page 3-628, Vol. 2A, Intel Instruction reference):
This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes globally visible before any load or store instruction that
follows the MFENCE instruction.1
A load instruction is considered to become globally visible when
the value to be loaded into its destination register is determined.
Using fences in C++ is tricky (C++11 may have fence semantics somewhere, maybe someone else has info on that), as it is platform and compiler dependent. For x86 using MSVC or ICC, you can use the _mm_lfence, _mm_sfence & _mm_mfence for load, store and load + store fencing (note that some of these are SSE2 instructions).
Note: this assumes an Intel perspective, that is: one using an x86 (32 or 64 bit) or IA64 processor

C++11 (ISO/IEC 14882:2011) defines a multi-threading-aware memory model.
Although I don't know of any compiler that currently implements the new memory model, C++ Concurrency in Action by Anthony Williams documents it very well. You may check Chapter 5 - The C++ Memory Model and Operations on Atomic Types where he explains about relaxed operations and memory fences. Also, he is the author of the just::thread library that may be used till we have compiler vendor support of the new standard.
just::thread is the base for the boost::thread library.

Difference between n = 0 and n = n - n

When I read this question I remembered someone once telling me (many years ago) that from an assembler-point-of-view, these two operations are very different:
n = 0;
n = n - n;
Is this true, and if it is, why is it so?
EDIT: As pointed out by some replies, I guess this would be fairly easy for a compiler to optimize into the same thing. But what I find interesting is why they would differ if the compiler had a completely general approach.

Writing assembler code you often used:
xor eax, eax
instead of
mov eax, 0
That is because with the first statement you have only the opcode and no involved argument. Your CPU will do that in 1 cylce (instead of 2). I think your case is something similar (although using sub).

Compiler VC++ 6.0, without optimisations:
4: n = 0;
0040102F mov dword ptr [ebp-4],0
5:
6: n = n - n;
00401036 mov eax,dword ptr [ebp-4]
00401039 sub eax,dword ptr [ebp-4]
0040103C mov dword ptr [ebp-4],eax

In the early days, memory and CPU cycles were scarce. That lead to a lot of so called "peep-hole optimizations". Let's look at the code:
move.l #0,d0
moveq.l #0,d0
sub.l a0,a0
The first instruction would need two bytes for the op-code and then four bytes for the value (0). That meant four bytes wasted plus you'd need to access the memory twice (once for the opcode and once for the data). Sloooow.
moveq.l was better since it would merge the data into the op-code but it only allowed to write values between 0 and 7 into a register. And you were limited to data registers only, there was no quick way to clear an address register. You'd have to clear a data register and then load the data register into an address register (two op-codes. Bad.).
Which lead to the last operation which works on any register, need only two bytes, a single memory read. Translated into C, you'd get
n = n - n;
which would work for most often used types of n (integer or pointer).

An optimizing compiler will produce the same assembly code for the two.

It may depend on whether n is declared as volatile or not.

The assembly-language technique of zeroing a register by subtracting it from itself or XORing it with itself is an interesting one, but it doesn't really translate to C.
Any optimising C compiler will use this technique if it makes sense, and trying to write it out explicitly is unlikely to achieve anything.

In C they only differ (for integer types) if your compiler sucks (or you disabled optimization like an MSVC answer shows).
Perhaps the person who told you this way trying to describe an asm instruction like sub reg,reg using C syntax, not talking about how such a statement would actually compile with a modern optimizing compiler? In which case I wouldn't say "very different" for most x86 CPUs; most do special case sub same,same as a zeroing idiom, like xor same,same. What is the best way to set a register to zero in x86 assembly: xor, mov or and?
That makes an asm sub reg,reg similar to mov reg,0, with somewhat better code size. (But yes, some unique benefits wrt. partial-register renaming on Intel P6-family that you can only get from zeroing idioms, not mov).
They could differ in C if your compiler is trying to implement the mostly-deprecated memory_order_consume semantics from <stdatomic.h> on a weakly-ordered ISA like ARM or PowerPC, where n=0 breaks the dependency on the old value but n = n-n; still "carries a dependency", so a load like array[n] will be dependency-ordered after n = atomic_load_explicit(&shared_var, memory_order_consume). See Memory order consume usage in C11 for more details
In practice compilers gave up on trying to get that dependency-tracking right and promote consume loads to acquire. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html and When should you not use [[carries_dependency]]?
But in asm for weakly-ordered ISAs, sub dst, same, same is required to stil carry a dependency on the input register, just like in C. (Most weakly-ordered ISAs are RISCs with fixed-width instructions so avoiding an immediate operand doesn't make the machine code any smaller. Thus there is no historical use of shorter zeroing idioms like sub r1, r1, r1 even on ISAs like ARM that don't have an architectural zero register. mov r1, #0 is the same size and at least as efficient as any other way. On MIPS you'd just move $v0, $zero)
So yes, for those non-x86 ISAs, they are very different in asm. n=0 avoids any false dependency on the old value of the variable (register), while n=n-n can't execute until the old value of n is ready.
Only x86 special-cases sub same,same and xor same,same as a dependency-breaking zeroing idiom like mov eax, imm32, because mov eax, 0 is 5 bytes but xor eax,eax is only 2. So there was a long history of using this peephole optimization before out-of-order execution CPUs, and such CPUs needed to run existing code efficiently. What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains the details.
Unless you're writing by hand in x86 asm, write 0 like a normal person instead of n-n or n^n, and let the compiler use xor-zeroing as a peephole optimization.
Asm for other ISAs might have other peepholes, e.g. another answer mentions m68k. But again, if you're writing in C this is the compiler's job. Write 0 when you mean 0. Trying to "hand hold" the compiler into using an asm peephole is very unlikely to work with optimization disabled, and with optimization enabled the compiler will efficiently zero a register if it needs to.

not sure about assembly and such, but generally,
n=0
n=n-n
isnt always equal if n is floating point, see here
http://www.codinghorror.com/blog/archives/001266.html

Here are some corner cases where the behavior is different for n = 0 and n = n - n:
if n has a floating point type, the result will differ from 0 for specific values: -0.0, Infinity, -Infinity, NaN...
if n is defined as volatile: the first expression will generate a single store into the corresponding memory location, while the second expression will generate two loads and a store, furthermore if n is the location of a hardware register, the 2 loads might yield different values, causing the write to store a non 0 value.
if optimisations are disabled, the compiler might generate different code for these 2 expressions even for plain int n, which might or might not execute at the speed.

"register" keyword in C?

What does the register keyword do in C language? I have read that it is used for optimizing but is not clearly defined in any standard. Is it still relevant and if so, when would you use it?

It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
Most modern compilers do that automatically, and are better at picking them than us humans.

I'm surprised that nobody mentioned that you cannot take an address of register variable, even if compiler decides to keep variable in memory rather than in register.
So using register you win nothing (anyway compiler will decide for itself where to put the variable) and lose the & operator - no reason to use it.

It tells the compiler to try to use a CPU register, instead of RAM, to store the variable. Registers are in the CPU and much faster to access than RAM. But it's only a suggestion to the compiler, and it may not follow through.

I know this question is about C, but the same question for C++ was closed as a exact duplicate of this question. This answer therefore may not apply for C.
The latest draft of the C++11 standard, N3485, says this in 7.1.1/3:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]
In C++ (but not in C), the standard does not state that you can't take the address of a variable declared register; however, because a variable stored in a CPU register throughout its lifetime does not have a memory location associated with it, attempting to take its address would be invalid, and the compiler will ignore the register keyword to allow taking the address.

I have read that it is used for optimizing but is not clearly defined in any standard.
In fact it is clearly defined by the C standard. Quoting the N1570 draft section 6.7.1 paragraph 6 (other versions have the same wording):
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.
The unary & operator may not be applied to an object defined with register, and register may not be used in an external declaration.
There are a few other (fairly obscure) rules that are specific to register-qualified objects:
Defining an array object with register has undefined behavior.
Correction: It's legal to define an array object with register, but you can't do anything useful with such an object (indexing into an array requires taking the address of its initial element).
The _Alignas specifier (new in C11) may not be applied to such an object.
If the parameter name passed to the va_start macro is register-qualified, the behavior is undefined.
There may be a few others; download a draft of the standard and search for "register" if you're interested.
As the name implies, the original meaning of register was to require an object to be stored in a CPU register. But with improvements in optimizing compilers, this has become less useful. Modern versions of the C standard don't refer to CPU registers, because they no longer (need to) assume that there is such a thing (there are architectures that don't use registers). The common wisdom is that applying register to an object declaration is more likely to worsen the generated code, because it interferes with the compiler's own register allocation. There might still be a few cases where it's useful (say, if you really do know how often a variable will be accessed, and your knowledge is better than what a modern optimizing compiler can figure out).
The main tangible effect of register is that it prevents any attempt to take an object's address. This isn't particularly useful as an optimization hint, since it can be applied only to local variables, and an optimizing compiler can see for itself that such an object's address isn't taken.

It hasn't been relevant for at least 15 years as optimizers make better decisions about this than you can. Even when it was relevant, it made a lot more sense on a CPU architecture with a lot of registers, like SPARC or M68000 than it did on Intel with its paucity of registers, most of which are reserved by the compiler for its own purposes.

Actually, register tells the compiler that the variable does not alias with
anything else in the program (not even char's).
That can be exploited by modern compilers in a variety of situations, and can help the compiler quite a bit in complex code - in simple code the compilers can figure this out on their own.
Otherwise, it serves no purpose and is not used for register allocation. It does not usually incur performance degradation to specify it, as long as your compiler is modern enough.

Storytime!
C, as a language, is an abstraction of a computer. It allows you to do things, in terms of what a computer does, that is manipulate memory, do math, print things, etc.
But C is only an abstraction. And ultimately, what it's extracting from you is Assembly language. Assembly is the language that a CPU reads, and if you use it, you do things in terms of the CPU. What does a CPU do? Basically, it reads from memory, does math, and writes to memory. The CPU doesn't just do math on numbers in memory. First, you have to move a number from memory to memory inside the CPU called a register. Once you're done doing whatever you need to do to this number, you can move it back to normal system memory. Why use system memory at all? Registers are limited in number. You only get about a hundred bytes in modern processors, and older popular processors were even more fantastically limited (The 6502 had 3 8-bit registers for your free use). So, your average math operation looks like:
load first number from memory
load second number from memory
add the two
store answer into memory
A lot of that is... not math. Those load and store operations can take up to half your processing time. C, being an abstraction of computers, freed the programmer the worry of using and juggling registers, and since the number and type vary between computers, C places the responsibility of register allocation solely on the compiler. With one exception.
When you declare a variable register, you are telling the compiler "Yo, I intend for this variable to be used a lot and/or be short lived. If I were you, I'd try to keep it in a register." When the C standard says compilers don't have to actually do anything, that's because the C standard doesn't know what computer you're compiling for, and it might be like the 6502 above, where all 3 registers are needed just to operate, and there's no spare register to keep your number. However, when it says you can't take the address, that's because registers don't have addresses. They're the processor's hands. Since the compiler doesn't have to give you an address, and since it can't have an address at all ever, several optimizations are now open to the compiler. It could, say, keep the number in a register always. It doesn't have to worry about where it's stored in computer memory (beyond needing to get it back again). It could even pun it into another variable, give it to another processor, give it a changing location, etc.
tl;dr: Short-lived variables that do lots of math. Don't declare too many at once.

You are messing with the compiler's sophisticated graph-coloring algorithm. This is used for register allocation. Well, mostly. It acts as a hint to the compiler -- that's true. But not ignored in its entirety since you are not allowed to take the address of a register variable (remember the compiler, now on your mercy, will try to act differently). Which in a way is telling you not to use it.
The keyword was used long, long back. When there were only so few registers that could count them all using your index finger.
But, as I said, deprecated doesn't mean you cannot use it.

Just a little demo (without any real-world purpose) for comparison: when removing the register keywords before each variable, this piece of code takes 3.41 seconds on my i7 (GCC), with register the same code completes in 0.7 seconds.
#include <stdio.h>
int main(int argc, char** argv) {
register int numIterations = 20000;
register int i=0;
unsigned long val=0;
for (i; i<numIterations+1; i++)
{
register int j=0;
for (j;j<i;j++)
{
val=j+i;
}
}
printf("%d", val);
return 0;
}

I have tested the register keyword under QNX 6.5.0 using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
int main(int argc, char *argv[]) {
uint64_t cps, cycle1, cycle2, ncycles;
double sec;
register int a=0, b = 1, c = 3, i;
cycle1 = ClockCycles();
for(i = 0; i < 100000000; i++)
a = ((a + b + c) * c) / 2;
cycle2 = ClockCycles();
ncycles = cycle2 - cycle1;
printf("%lld cycles elapsed\n", ncycles);
cps = SYSPAGE_ENTRY(qtime) -> cycles_per_sec;
printf("This system has %lld cycles per second\n", cps);
sec = (double)ncycles/cps;
printf("The cycles in seconds is %f\n", sec);
return EXIT_SUCCESS;
}
I got the following results:
-> 807679611 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.244600
And now without register int:
int a=0, b = 1, c = 3, i;
I got:
-> 1421694077 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.430700

During the seventies, at the very beginning of the C language, the register keyword has been introduced in order to allow the programmer to give hints to the compiler, telling it that the variable would be used very often, and that it should be wise to keep it’s value in one of the processor’s internal register.
Nowadays, optimizers are much more efficient than programmers to determine variables that are more likely to be kept into registers, and the optimizer does not always take the programmer’s hint into account.
So many people wrongly recommend not to use the register keyword.
Let’s see why!
The register keyword has an associated side effect: you can not reference (get the address of) a register type variable.
People advising others not to use registers takes wrongly this as an additional argument.
However, the simple fact of knowing that you can not take the address of a register variable, allows the compiler (and its optimizer) to know that the value of this variable can not be modified indirectly through a pointer.
When at a certain point of the instruction stream, a register variable has its value assigned in a processor’s register, and the register has not been used since to get the value of another variable, the compiler knows that it does not need to re-load the value of the variable in that register.
This allows to avoid expensive useless memory access.
Do your own tests and you will get significant performance improvements in your most inner loops.
c_register_side_effect_performance_boost

Register would notify the compiler that the coder believed this variable would be written/read enough to justify its storage in one of the few registers available for variable use. Reading/writing from registers is usually faster and can require a smaller op-code set.
Nowadays, this isn't very useful, as most compilers' optimizers are better than you at determining whether a register should be used for that variable, and for how long.

gcc 9.3 asm output, without using optimisation flags (everything in this answer refers to standard compilation without optimisation flags):
#include <stdio.h>
int main(void) {
int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
#include <stdio.h>
int main(void) {
register int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
add ebx, 1
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
add rsp, 8
pop rbx
pop rbp
ret
This forces ebx to be used for the calculation, meaning it needs to be pushed to the stack and restored at the end of the function because it is callee saved. register produces more lines of code and 1 memory write and 1 memory read (although realistically, this could have been optimised to 0 R/Ws if the calculation had been done in esi, which is what happens using C++'s const register). Not using register causes 2 writes and 1 read (although store to load forwarding will occur on the read). This is because the value has to be present and updated directly on the stack so the correct value can be read by address (pointer). register doesn't have this requirement and cannot be pointed to. const and register are basically the opposite of volatile and using volatile will override the const optimisations at file and block scope and the register optimisations at block-scope. const register and register will produce identical outputs because const does nothing on C at block-scope, so only the register optimisations apply.
On clang, register is ignored but const optimisations still occur.

On supported C compilers it tries to optimize the code so that variable's value is held in an actual processor register.

Microsoft's Visual C++ compiler ignores the register keyword when global register-allocation optimization (the /Oe compiler flag) is enabled.
See register Keyword on MSDN.

Register keyword tells compiler to store the particular variable in CPU registers so that it could be accessible fast. From a programmer's point of view register keyword is used for the variables which are heavily used in a program, so that compiler can speedup the code. Although it depends on the compiler whether to keep the variable in CPU registers or main memory.

Register indicates to compiler to optimize this code by storing that particular variable in registers then in memory. it is a request to compiler, compiler may or may not consider this request.
You can use this facility in case where some of your variable are being accessed very frequently.
For ex: A looping.
One more thing is that if you declare a variable as register then you can't get its address as it is not stored in memory. it gets its allocation in CPU register.

The register keyword is a request to the compiler that the specified variable is to be stored in a register of the processor instead of memory as a way to gain speed, mostly because it will be heavily used. The compiler may ignore the request.