Segmentation fault when attempting to print int value from x86 external function [duplicate]

Segmentation fault when attempting to print int value from x86 external function [duplicate] - c

I've noticed that a lot of calling conventions insist that [e]bx be preserved for the callee.
Now, I can understand why they'd preserve something like [e]sp or [e]bp, since that can mess up the callee's stack. I can also understand why you might want to preserve [e]si or [e]di since that can break the callee's string instructions if they aren't particularly careful.
But [e]bx? What on earth is so important about [e]bx? What makes [e]bx so special that multiple calling conventions insist that it be preserved throughout function calls?
Is there some sort of subtle bug/gotcha that can arise from messing with [e]bx?
Does modifying [e]bx somehow have a greater impact on the callee than modifying [e]dx or [e]cx for instance?
I just don't understand why so many calling conventions single out [e]bx for preservation.

Not all registers make good candidates for preserving:
no (e)ax -- Implicitly used in some instructions; Return value
no (e)dx -- edx:eax is implicity used in cdq, div, mul and in return values
(e)bx -- generic register, usable in 16-bit addressing modes (base)
(e)cx -- shift-counts, used in loop, rep
(e)si -- movs operations, usable in 16-bit addressing modes (index)
(e)di -- movs operations, usable in 16-bit addressing modes (index)
Must (e)bp -- frame pointer, usable in 16-bit addressing modes (base)
Must (e)sp -- stack pointer, not addressable in 8086 (other than push/pop)
Looking at the table, two registers have good reason to be preserved and two have a reason not to be preserved. accumulator = (e)ax e.g. is the most often used register due to short encoding. SI,DI make a logical register pair -- on REP MOVS and other string operations, both are trashed.
In a half and half callee/caller saving paradigm the discussion would basically go only if bx/cx is preferred over si/di. In other calling conventions, it's just EDX,EAX and ECX that can be trashed.
EBX does have a few obscure implicit uses that are still relevant in modern code (e.g. CMPXGH8B / CMPXGH16B), but it's the least special register in 32/64-bit code.
EBX makes a good choice for a call-preserved register because it's rare that a function will need to save/restore EBX because they need EBX specifically, and not just any non-volatile register. As Brett Hale's answer points out, it makes EBX a great choice for the global offset table (GOT) pointer in ABIs that need one.
In 16-bit mode, addressing modes were limited to (any subset of) [BP|BX + DI|SI + disp8/disp16]), so BX is definitely special there.

This is a compromise between not saving any of the registers and saving them all. Either saving none, or saving all, could have been proposed, but either extreme leads to inefficiencies caused by copying the contents to memory (the stack). Choosing to allow some registers to be preserved and some not, reduces the average cost of a function call.

One of the main reasons, certainly for the i386 ELF ABI, is that ebx holds the address of the global offset table (GOT) register for position-independent code (PIC). See 3-35 of the specification for the details. It would be disruptive in the extreme, if, say, shared library code had to restore the GOT after every function call return.

Related

x86_64 : is stack frame pointer almost useless?

Linux x86_64.
gcc 5.x
I was studying the output of two codes, with -fomit-frame-pointer and without (gcc at "-O3" enables that option by default).
pushq %rbp
movq %rsp, %rbp
...
popq %rbp
My question is :
If I globally disable that option, even for, at the extreme, compiling an operating system, is there a catch ?
I know that interrupts use that information, so is that option good only for user space ?

The compilers always generate self consistent code, so disabling the frame pointer is fine as long as you don't use external/hand crafted code that makes some assumption about it (e.g. by relying on the value of rbp for example).
The interrupts don't use the frame pointer information, they may use the current stack pointer for saving a minimal context but this is dependent on the type of interrupt and OS (an hardware interrupt uses a Ring 0 stack probably).
You can look at Intel manuals for more information on this.
About the usefulness of the frame pointer:
Years ago, after compiling a couple of simple routines and looking at the generated 64 bit assembly code I had your same question.
If you don't mind reading a whole lot of notes I have written for myself back then, here they are.
Note: Asking about the usefulness of something is a little bit relative. Writing assembly code for the current main 64 bit ABIs I found my self using the stack frame less and less. However this is just my coding style and opinion.
I like using the frame pointer, writing the prologue and epilogue of a function, but I like direct uncomfortable answers too, so here's how I see it:
Yes, the frame pointer is almost useless in x86_64.
Beware it is not completely useless, especially for humans, but a compiler doesn't need it anymore.
To better understand why we have a frame pointer in the first place it is better to recall some history.
Back in the real mode (16 bit) days
When Intel CPUs supported only "16 bit mode" there were some limitation on how to access the stack, particularly this instruction was (and still is) illegal
mov ax, WORD [sp+10h]
because sp cannot be used as a base register. Only a few designated registers could be used for such purpose, for example bx or the more famous bp.
Nowadays it's not a detail everybody put their eyes on but bp has the advantage over other base register that by default it implicitly implicates the use of ss as a segment/selector register, just like implicit usages of sp (by push, pop, etc), and like esp does on later 32-bit processors.
Even if your program was scattered all across memory with each segment register pointing to a different area, bp and sp acted the same, after all that was the intent of the designers.
So a stack frame was usually necessary and consequently a frame pointer.
bp effectively partitioned the stack in four parts: the arguments area, the return address, the old bp (just a WORD) and the local variables area. Each area being identified by the offset used to access it: positive for the arguments and return address, zero for the old bp, negative for the local variables.
Extended effective addresses
As the Intel CPUs were evolving, the more extensive 32-bit addressing modes were added.
Specifically the possibility to use any 32-bit general-purpose register as a base register, this includes the use of esp.
Being instructions like this
mov eax, DWORD [esp+10h]
now valid, the use of the stack frame and the frame pointer seems doomed to an end.
Likely this was not the case, at least in the beginnings.
It is true that now it is possible to use entirely esp but the separation of the stack in the mentioned four areas is still useful, especially for humans.
Without the frame pointer a push or a pop would change an argument or local variable offset relative to esp, giving form to code that look non intuitive at first sight. Consider how to implement the following C routine with cdecl calling convention:
void my_routine(int a, int b)
{
return my_add(a, b);
}
without and with a framestack
my_routine:
push DWORD [esp+08h]
push DWORD [esp+08h]
call my_add
ret
my_routine:
push ebp
mov ebp, esp
push DWORD [ebp+0Ch]
push DWORD [ebp+08h]
call my_add
pop ebp
ret
At first sight it seems that the first version pushes the same value twice. It actually pushes the two separate arguments however, as the first push lowers esp so the same effective address calculation points the second push to a different argument.
If you add local variables (especially lots of them) then the situation quickly becomes hard to read: Does mov eax, [esp+0CAh] refer to a local variable or to an argument? With a stack frame we have fixed offsets for the arguments and local variables.
Even the compilers at first still preferred the fixed offsets given by the use of the frame base pointer. I see this behavior changing first with gcc.
In a debug build the stack frame effectively adds clarity to the code and makes it easy for the (proficient) programmer to follow what is going on and, as pointed out in the comment, lets them recover the stack frame more easily.
The modern compilers however are good at math and can easily keep count of the stack pointer movements and generate the appropriate offsets from esp, omitting the stack frame for faster execution.
When a CISC requires data alignment
Until the introduction of SSE instructions the Intel processors never asked much from the programmers compared to their RISC brothers.
In particular they never asked for data alignment, we could access 32 bit data on an address not a multiple of 4 with no major complaint (depending on the DRAM data width, this may result on increased latency).
SSE used 16 bytes operands that needed to be accessed on 16 byte boundary, as the SIMD paradigm becomes implemented efficiently in the hardware and becomes more popular the alignment on 16 byte boundary becomes important.
The main 64 bit ABIs now require it, the stack must be aligned on paragraphs (ie, 16 bytes).
Now, we are usually called such that after the prologue the stack is aligned, but suppose we are not blessed with that guarantee, we would need to do one of this
push rbp push rbp
mov rbp, rsp mov rbp, rsp
and spl, 0f0h sub rsp, xxx
sub rsp, 10h*k and spl, 0f0h
One way or another the stack is aligned after these prologues, however we can no longer use a negative offset from rbp to access local vars that need alignment, because the frame pointer itself is not aligned.
We need to use rsp, we could arrange a prologue that has rbp pointing at the top of an aligned area of local vars but then the arguments would be at unknown offsets.
We can arrange a complex stack frame (maybe with more than one pointer) but the key of the old fashioned frame base pointer was its simplicity.
So we can use the frame pointer to access the arguments on the stack and the stack pointer for the local variables, fair enough.
Alas the role of stack for arguments passing has been reduced and for a small number of arguments (currently four) it is not even used and in the future it will probably be used even less.
So we don't use the frame pointer for local variables (mostly), nor for the arguments (mostly), for what do we use it?
It saves a copy of the original rsp, so to restore the stack pointer at function exit, a mov is enough. If the stack is aligned with an and, which is not invertible, an original copy is necessary.
Actually some ABIs guarantee that after the standard prologue the stack is aligned thereby allowing us to use the frame pointer as usual.
Some variables don't need alignment and can be accessed with an unaligned frame pointer, this is usually true for hand crafted code.
Some functions require more than four parameters.
Summary
The frame pointer is a vestigial paradigm from 16 bit programs that has proven itself still useful on 32 bit machines because of its simplicity and clarity when accessing local variables and arguments.
On 64 bit machines however the strict requirements vanish most of the simplicity and clarity, the frame pointer remains used in debug mode however.
On the fact that the frame pointer can be used to make fun things: it is true I guess, I've never seen such code but I can image how it would work.
I, however, focused on the housekeeping role of the frame pointer as this is the way I always have seen it.
All the crazy things can be done with any pointer set to the same value of the frame pointer, I give the latter a more "special" role.
VS2013 for example sometimes uses rdi as a "frame pointer", but I don't consider it a real frame pointer if it doesn't use rbp/ebp/bp.
To me the use of rdi means a Frame Pointer Omission optimization :)

Saving registers state in COM program

I disassembled a simple DOS .COM program and there was some code which saves and restores registers values
PUSH AX ; this is the first instruction
PUSH CX
....
POP CX
POP AX
MOV AX, 0x00 0x4C
INT 21 // call DOS interrupt 21 => END
This is very similar to function prologue and epilogue in C programs. But prologues are added automatically by compiler, and the program above was written manually in assembler, so the programmer took full responsibility for saving and restoring values in this code.
My question is what will happen if I unintentionally forgot to save some registers in my program?
And what if I intentionally replace these instructions to NOP in HEX editor? Will this lead to program crash? And why called function is responsible for saving outer context on the stack? From my point of view this should be done somehow in calling function to prevent problems if I use 3rd party libraries and poorly written code which may break my program execution.

One problem of making the calling function save all of its working registers before calling another function is that sometimes a function is interrupted (i.e. a hardware interrupt) without its knowledge. In DOS, for example, there was that pesky 54 millisecond timer tick. 18 times per second, a hardware interrupt would transfer control from whatever code was executing to the timer tick handler. This happened automatically unless your program specifically disabled interrupts.
The timer tick handler would then save all of the registers it was going to use, do its work, and then restore the registers it saved before returning.
Sure, you could say that interrupt handlers are special, but why? Even with the paucity of registers on the 8086 (AX, BX, CX, DX, SI, DI, Flags -- did I forget anything? I purposely didn't include the segment registers), making a function save its entire state before transferring control means that you'd be using a lot of unnecessary stack space and execution cycles to save things because they might be modified. But if the called function is responsible for saving just the registers it uses, and it only uses AX and CX, then it can save just those two registers. It makes for smaller and faster code, and much less stack space usage.
When you start talking about call hierarchies that are many levels deep, the difference between pushing 8 registers rather than 2 registers adds up pretty quickly.
Consider the x86-64, with its 64 general purpose registers. Do you really think a function should be forced to save all 64 of those registers before calling another function, even when the called function only uses two of them? Saving 64 64-bit registers requires 512 bytes of stack space. As opposed to saving two registers requiring only 16 bytes.
The primary point of writing things in assembly language these days is to write faster and smaller code than what a compiler can write. A guiding principle is don't do more work than you have to. That means it's up to you to know what registers your assembly language function is using, and to save those registers on entry and restore them on exit.

If you don't want to guard against forgetting what to push or pop I would advise sticking to a higher level language.
In assembler, if the function is your own then you should save and restore all registers you use within the function except those which return an output from the function. If others wrote the function, look up its documentation. If in doubt, save/restore registers before/after calling the function (except those which are supposed to return a value).

Since the DOS Terminate function does not rely on any register settings (other than AX) for its operation (*) both pushes/pops in the code you have posted seem superfluous. You should however be aware that the programmer could have pushed these values for the purpose of using them locally! So replacing both these pushes by NOP in HEX editor is surely a bad idea. You could however replace both pops by NOP because at that point in the program the restoration of AX/CX as well as balancing the stack are unnecessary because of (*).
Since your question is about saving registers on the program level the answer must be that pushing/popping registers for the sake of saving them is useless. Nothing bad will happen if you unintentionally forgot to save some registers in your program.

Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

Question:
Is accessing the stack the same speed as accessing memory?
For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.
So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)
Motivation for Question:
It is common in C to discourage trivial functions that take parameters.
I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.
But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.
(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)

Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.
Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.
Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).
If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.
Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.
I'd say there's no definitive answer for all cases because performance depends on:
your hardware
your compiler
your program and its memory accessing patterns

For the clock-cycle-curious...
For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).
You then get, on a Pentium 4 chip:
push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax
As far as the follow-on question in the comments (3rd comment):
indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.
Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.
(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)
Understanding Latency & Throughput
A bit more may be needed to understand timing tables for modern CPUs. These should help:
definitions of latency and throughput
a useful analogy for latency and throughput, and their relation to instruction processing pipelines)

Recognizing stack frames in a stack using saved EBP values

I would like to divide a stack to stack-frames by looking on the raw data on the stack. I thought to do so by finding a "linked list" of saved EBP pointers.
Can I assume that a (standard and commonly used) C compiler (e.g. gcc) will always update and save EBP on a function call in the function prologue?
pushl %ebp
movl %esp, %ebp
Or are there cases where some compilers might skip that part for functions that don't get any parameters and don't have local variables?
The x86 calling conventions and the Wiki article on function prologue don't help much with that.
Is there any better method to divide a stack to stack frames just by looking on its raw data?
Thanks!

Some versions of gcc have a -fomit-frame-pointer optimization option. If memory serves, it can be used even with parameters/local variables (they index directly off of ESP instead of using EBP). Unless I'm badly mistaken, MS VC++ can do roughly the same.
Offhand, I'm not sure of a way that's anywhere close to universally applicable. If you have code with debug info, it's usually pretty easy -- otherwise though...

Even with the framepointer optimized out, stackframes are often distinguishable by looking through stack memory for saved return addresses instead. Remember that a function call sequence in x86 always consists of:
call someFunc ; pushes return address (instr. following `call`)
...
someFunc:
push EBP ; if framepointer is used
mov EBP, ESP ; if framepointer is used
push <nonvolatile regs>
...
so your stack will always - even if the framepointers are missing - have return addresses in there.
How do you recognize a return address ?
to start with, on x86, instruction have different lengths. That means return addresses - unlike other pointers (!) - tend to be misaligned values. Statistically 3/4 of them end not at a multiple of four.
Any misaligned pointer is a good candidate for a return address.
then, remember that call instructions on x86 have specific opcode formats; read a few bytes before the return address and check if you find a call opcode there (99% most of the time, it's five bytes back for a direct call, and three bytes back for a call through a register). If so, you've found a return address.
This is also a way to distinguish C++ vtables from return addresses by the way - vtable entrypoints you'll find on the stack, but looking "back" from those addresses you don't find call instructions.
With that method, you can get candidates for the call sequence out of the stack even without having symbols, framesize debugging information or anything.
The details of how to piece the actual call sequence together from those candidates are less straightforward though, you need a disassembler and some heuristics to trace potential call flows from the lowest-found return address all the way up to the last known program location. Maybe one day I'll blog about it ;-) though at this point I'd rather say that the margin of a stackoverflow posting is too small to contain this ...

"register" keyword in C?

What does the register keyword do in C language? I have read that it is used for optimizing but is not clearly defined in any standard. Is it still relevant and if so, when would you use it?

It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
Most modern compilers do that automatically, and are better at picking them than us humans.

I'm surprised that nobody mentioned that you cannot take an address of register variable, even if compiler decides to keep variable in memory rather than in register.
So using register you win nothing (anyway compiler will decide for itself where to put the variable) and lose the & operator - no reason to use it.

It tells the compiler to try to use a CPU register, instead of RAM, to store the variable. Registers are in the CPU and much faster to access than RAM. But it's only a suggestion to the compiler, and it may not follow through.

I know this question is about C, but the same question for C++ was closed as a exact duplicate of this question. This answer therefore may not apply for C.
The latest draft of the C++11 standard, N3485, says this in 7.1.1/3:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]
In C++ (but not in C), the standard does not state that you can't take the address of a variable declared register; however, because a variable stored in a CPU register throughout its lifetime does not have a memory location associated with it, attempting to take its address would be invalid, and the compiler will ignore the register keyword to allow taking the address.

I have read that it is used for optimizing but is not clearly defined in any standard.
In fact it is clearly defined by the C standard. Quoting the N1570 draft section 6.7.1 paragraph 6 (other versions have the same wording):
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.
The unary & operator may not be applied to an object defined with register, and register may not be used in an external declaration.
There are a few other (fairly obscure) rules that are specific to register-qualified objects:
Defining an array object with register has undefined behavior.
Correction: It's legal to define an array object with register, but you can't do anything useful with such an object (indexing into an array requires taking the address of its initial element).
The _Alignas specifier (new in C11) may not be applied to such an object.
If the parameter name passed to the va_start macro is register-qualified, the behavior is undefined.
There may be a few others; download a draft of the standard and search for "register" if you're interested.
As the name implies, the original meaning of register was to require an object to be stored in a CPU register. But with improvements in optimizing compilers, this has become less useful. Modern versions of the C standard don't refer to CPU registers, because they no longer (need to) assume that there is such a thing (there are architectures that don't use registers). The common wisdom is that applying register to an object declaration is more likely to worsen the generated code, because it interferes with the compiler's own register allocation. There might still be a few cases where it's useful (say, if you really do know how often a variable will be accessed, and your knowledge is better than what a modern optimizing compiler can figure out).
The main tangible effect of register is that it prevents any attempt to take an object's address. This isn't particularly useful as an optimization hint, since it can be applied only to local variables, and an optimizing compiler can see for itself that such an object's address isn't taken.

It hasn't been relevant for at least 15 years as optimizers make better decisions about this than you can. Even when it was relevant, it made a lot more sense on a CPU architecture with a lot of registers, like SPARC or M68000 than it did on Intel with its paucity of registers, most of which are reserved by the compiler for its own purposes.

Actually, register tells the compiler that the variable does not alias with
anything else in the program (not even char's).
That can be exploited by modern compilers in a variety of situations, and can help the compiler quite a bit in complex code - in simple code the compilers can figure this out on their own.
Otherwise, it serves no purpose and is not used for register allocation. It does not usually incur performance degradation to specify it, as long as your compiler is modern enough.

Storytime!
C, as a language, is an abstraction of a computer. It allows you to do things, in terms of what a computer does, that is manipulate memory, do math, print things, etc.
But C is only an abstraction. And ultimately, what it's extracting from you is Assembly language. Assembly is the language that a CPU reads, and if you use it, you do things in terms of the CPU. What does a CPU do? Basically, it reads from memory, does math, and writes to memory. The CPU doesn't just do math on numbers in memory. First, you have to move a number from memory to memory inside the CPU called a register. Once you're done doing whatever you need to do to this number, you can move it back to normal system memory. Why use system memory at all? Registers are limited in number. You only get about a hundred bytes in modern processors, and older popular processors were even more fantastically limited (The 6502 had 3 8-bit registers for your free use). So, your average math operation looks like:
load first number from memory
load second number from memory
add the two
store answer into memory
A lot of that is... not math. Those load and store operations can take up to half your processing time. C, being an abstraction of computers, freed the programmer the worry of using and juggling registers, and since the number and type vary between computers, C places the responsibility of register allocation solely on the compiler. With one exception.
When you declare a variable register, you are telling the compiler "Yo, I intend for this variable to be used a lot and/or be short lived. If I were you, I'd try to keep it in a register." When the C standard says compilers don't have to actually do anything, that's because the C standard doesn't know what computer you're compiling for, and it might be like the 6502 above, where all 3 registers are needed just to operate, and there's no spare register to keep your number. However, when it says you can't take the address, that's because registers don't have addresses. They're the processor's hands. Since the compiler doesn't have to give you an address, and since it can't have an address at all ever, several optimizations are now open to the compiler. It could, say, keep the number in a register always. It doesn't have to worry about where it's stored in computer memory (beyond needing to get it back again). It could even pun it into another variable, give it to another processor, give it a changing location, etc.
tl;dr: Short-lived variables that do lots of math. Don't declare too many at once.

You are messing with the compiler's sophisticated graph-coloring algorithm. This is used for register allocation. Well, mostly. It acts as a hint to the compiler -- that's true. But not ignored in its entirety since you are not allowed to take the address of a register variable (remember the compiler, now on your mercy, will try to act differently). Which in a way is telling you not to use it.
The keyword was used long, long back. When there were only so few registers that could count them all using your index finger.
But, as I said, deprecated doesn't mean you cannot use it.

Just a little demo (without any real-world purpose) for comparison: when removing the register keywords before each variable, this piece of code takes 3.41 seconds on my i7 (GCC), with register the same code completes in 0.7 seconds.
#include <stdio.h>
int main(int argc, char** argv) {
register int numIterations = 20000;
register int i=0;
unsigned long val=0;
for (i; i<numIterations+1; i++)
{
register int j=0;
for (j;j<i;j++)
{
val=j+i;
}
}
printf("%d", val);
return 0;
}

I have tested the register keyword under QNX 6.5.0 using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
int main(int argc, char *argv[]) {
uint64_t cps, cycle1, cycle2, ncycles;
double sec;
register int a=0, b = 1, c = 3, i;
cycle1 = ClockCycles();
for(i = 0; i < 100000000; i++)
a = ((a + b + c) * c) / 2;
cycle2 = ClockCycles();
ncycles = cycle2 - cycle1;
printf("%lld cycles elapsed\n", ncycles);
cps = SYSPAGE_ENTRY(qtime) -> cycles_per_sec;
printf("This system has %lld cycles per second\n", cps);
sec = (double)ncycles/cps;
printf("The cycles in seconds is %f\n", sec);
return EXIT_SUCCESS;
}
I got the following results:
-> 807679611 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.244600
And now without register int:
int a=0, b = 1, c = 3, i;
I got:
-> 1421694077 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.430700

During the seventies, at the very beginning of the C language, the register keyword has been introduced in order to allow the programmer to give hints to the compiler, telling it that the variable would be used very often, and that it should be wise to keep it’s value in one of the processor’s internal register.
Nowadays, optimizers are much more efficient than programmers to determine variables that are more likely to be kept into registers, and the optimizer does not always take the programmer’s hint into account.
So many people wrongly recommend not to use the register keyword.
Let’s see why!
The register keyword has an associated side effect: you can not reference (get the address of) a register type variable.
People advising others not to use registers takes wrongly this as an additional argument.
However, the simple fact of knowing that you can not take the address of a register variable, allows the compiler (and its optimizer) to know that the value of this variable can not be modified indirectly through a pointer.
When at a certain point of the instruction stream, a register variable has its value assigned in a processor’s register, and the register has not been used since to get the value of another variable, the compiler knows that it does not need to re-load the value of the variable in that register.
This allows to avoid expensive useless memory access.
Do your own tests and you will get significant performance improvements in your most inner loops.
c_register_side_effect_performance_boost

Register would notify the compiler that the coder believed this variable would be written/read enough to justify its storage in one of the few registers available for variable use. Reading/writing from registers is usually faster and can require a smaller op-code set.
Nowadays, this isn't very useful, as most compilers' optimizers are better than you at determining whether a register should be used for that variable, and for how long.

gcc 9.3 asm output, without using optimisation flags (everything in this answer refers to standard compilation without optimisation flags):
#include <stdio.h>
int main(void) {
int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
#include <stdio.h>
int main(void) {
register int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
add ebx, 1
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
add rsp, 8
pop rbx
pop rbp
ret
This forces ebx to be used for the calculation, meaning it needs to be pushed to the stack and restored at the end of the function because it is callee saved. register produces more lines of code and 1 memory write and 1 memory read (although realistically, this could have been optimised to 0 R/Ws if the calculation had been done in esi, which is what happens using C++'s const register). Not using register causes 2 writes and 1 read (although store to load forwarding will occur on the read). This is because the value has to be present and updated directly on the stack so the correct value can be read by address (pointer). register doesn't have this requirement and cannot be pointed to. const and register are basically the opposite of volatile and using volatile will override the const optimisations at file and block scope and the register optimisations at block-scope. const register and register will produce identical outputs because const does nothing on C at block-scope, so only the register optimisations apply.
On clang, register is ignored but const optimisations still occur.

On supported C compilers it tries to optimize the code so that variable's value is held in an actual processor register.

Microsoft's Visual C++ compiler ignores the register keyword when global register-allocation optimization (the /Oe compiler flag) is enabled.
See register Keyword on MSDN.

Register keyword tells compiler to store the particular variable in CPU registers so that it could be accessible fast. From a programmer's point of view register keyword is used for the variables which are heavily used in a program, so that compiler can speedup the code. Although it depends on the compiler whether to keep the variable in CPU registers or main memory.

Register indicates to compiler to optimize this code by storing that particular variable in registers then in memory. it is a request to compiler, compiler may or may not consider this request.
You can use this facility in case where some of your variable are being accessed very frequently.
For ex: A looping.
One more thing is that if you declare a variable as register then you can't get its address as it is not stored in memory. it gets its allocation in CPU register.

The register keyword is a request to the compiler that the specified variable is to be stored in a register of the processor instead of memory as a way to gain speed, mostly because it will be heavily used. The compiler may ignore the request.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight