"register" keyword in C? - c

What does the register keyword do in C language? I have read that it is used for optimizing but is not clearly defined in any standard. Is it still relevant and if so, when would you use it?

It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
Most modern compilers do that automatically, and are better at picking them than us humans.

I'm surprised that nobody mentioned that you cannot take an address of register variable, even if compiler decides to keep variable in memory rather than in register.
So using register you win nothing (anyway compiler will decide for itself where to put the variable) and lose the & operator - no reason to use it.

It tells the compiler to try to use a CPU register, instead of RAM, to store the variable. Registers are in the CPU and much faster to access than RAM. But it's only a suggestion to the compiler, and it may not follow through.

I know this question is about C, but the same question for C++ was closed as a exact duplicate of this question. This answer therefore may not apply for C.
The latest draft of the C++11 standard, N3485, says this in 7.1.1/3:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]
In C++ (but not in C), the standard does not state that you can't take the address of a variable declared register; however, because a variable stored in a CPU register throughout its lifetime does not have a memory location associated with it, attempting to take its address would be invalid, and the compiler will ignore the register keyword to allow taking the address.

I have read that it is used for optimizing but is not clearly defined in any standard.
In fact it is clearly defined by the C standard. Quoting the N1570 draft section 6.7.1 paragraph 6 (other versions have the same wording):
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.
The unary & operator may not be applied to an object defined with register, and register may not be used in an external declaration.
There are a few other (fairly obscure) rules that are specific to register-qualified objects:
Defining an array object with register has undefined behavior.
Correction: It's legal to define an array object with register, but you can't do anything useful with such an object (indexing into an array requires taking the address of its initial element).
The _Alignas specifier (new in C11) may not be applied to such an object.
If the parameter name passed to the va_start macro is register-qualified, the behavior is undefined.
There may be a few others; download a draft of the standard and search for "register" if you're interested.
As the name implies, the original meaning of register was to require an object to be stored in a CPU register. But with improvements in optimizing compilers, this has become less useful. Modern versions of the C standard don't refer to CPU registers, because they no longer (need to) assume that there is such a thing (there are architectures that don't use registers). The common wisdom is that applying register to an object declaration is more likely to worsen the generated code, because it interferes with the compiler's own register allocation. There might still be a few cases where it's useful (say, if you really do know how often a variable will be accessed, and your knowledge is better than what a modern optimizing compiler can figure out).
The main tangible effect of register is that it prevents any attempt to take an object's address. This isn't particularly useful as an optimization hint, since it can be applied only to local variables, and an optimizing compiler can see for itself that such an object's address isn't taken.

It hasn't been relevant for at least 15 years as optimizers make better decisions about this than you can. Even when it was relevant, it made a lot more sense on a CPU architecture with a lot of registers, like SPARC or M68000 than it did on Intel with its paucity of registers, most of which are reserved by the compiler for its own purposes.

Actually, register tells the compiler that the variable does not alias with
anything else in the program (not even char's).
That can be exploited by modern compilers in a variety of situations, and can help the compiler quite a bit in complex code - in simple code the compilers can figure this out on their own.
Otherwise, it serves no purpose and is not used for register allocation. It does not usually incur performance degradation to specify it, as long as your compiler is modern enough.

Storytime!
C, as a language, is an abstraction of a computer. It allows you to do things, in terms of what a computer does, that is manipulate memory, do math, print things, etc.
But C is only an abstraction. And ultimately, what it's extracting from you is Assembly language. Assembly is the language that a CPU reads, and if you use it, you do things in terms of the CPU. What does a CPU do? Basically, it reads from memory, does math, and writes to memory. The CPU doesn't just do math on numbers in memory. First, you have to move a number from memory to memory inside the CPU called a register. Once you're done doing whatever you need to do to this number, you can move it back to normal system memory. Why use system memory at all? Registers are limited in number. You only get about a hundred bytes in modern processors, and older popular processors were even more fantastically limited (The 6502 had 3 8-bit registers for your free use). So, your average math operation looks like:
load first number from memory
load second number from memory
add the two
store answer into memory
A lot of that is... not math. Those load and store operations can take up to half your processing time. C, being an abstraction of computers, freed the programmer the worry of using and juggling registers, and since the number and type vary between computers, C places the responsibility of register allocation solely on the compiler. With one exception.
When you declare a variable register, you are telling the compiler "Yo, I intend for this variable to be used a lot and/or be short lived. If I were you, I'd try to keep it in a register." When the C standard says compilers don't have to actually do anything, that's because the C standard doesn't know what computer you're compiling for, and it might be like the 6502 above, where all 3 registers are needed just to operate, and there's no spare register to keep your number. However, when it says you can't take the address, that's because registers don't have addresses. They're the processor's hands. Since the compiler doesn't have to give you an address, and since it can't have an address at all ever, several optimizations are now open to the compiler. It could, say, keep the number in a register always. It doesn't have to worry about where it's stored in computer memory (beyond needing to get it back again). It could even pun it into another variable, give it to another processor, give it a changing location, etc.
tl;dr: Short-lived variables that do lots of math. Don't declare too many at once.

You are messing with the compiler's sophisticated graph-coloring algorithm. This is used for register allocation. Well, mostly. It acts as a hint to the compiler -- that's true. But not ignored in its entirety since you are not allowed to take the address of a register variable (remember the compiler, now on your mercy, will try to act differently). Which in a way is telling you not to use it.
The keyword was used long, long back. When there were only so few registers that could count them all using your index finger.
But, as I said, deprecated doesn't mean you cannot use it.

Just a little demo (without any real-world purpose) for comparison: when removing the register keywords before each variable, this piece of code takes 3.41 seconds on my i7 (GCC), with register the same code completes in 0.7 seconds.
#include <stdio.h>
int main(int argc, char** argv) {
register int numIterations = 20000;
register int i=0;
unsigned long val=0;
for (i; i<numIterations+1; i++)
{
register int j=0;
for (j;j<i;j++)
{
val=j+i;
}
}
printf("%d", val);
return 0;
}

I have tested the register keyword under QNX 6.5.0 using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
int main(int argc, char *argv[]) {
uint64_t cps, cycle1, cycle2, ncycles;
double sec;
register int a=0, b = 1, c = 3, i;
cycle1 = ClockCycles();
for(i = 0; i < 100000000; i++)
a = ((a + b + c) * c) / 2;
cycle2 = ClockCycles();
ncycles = cycle2 - cycle1;
printf("%lld cycles elapsed\n", ncycles);
cps = SYSPAGE_ENTRY(qtime) -> cycles_per_sec;
printf("This system has %lld cycles per second\n", cps);
sec = (double)ncycles/cps;
printf("The cycles in seconds is %f\n", sec);
return EXIT_SUCCESS;
}
I got the following results:
-> 807679611 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.244600
And now without register int:
int a=0, b = 1, c = 3, i;
I got:
-> 1421694077 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.430700

During the seventies, at the very beginning of the C language, the register keyword has been introduced in order to allow the programmer to give hints to the compiler, telling it that the variable would be used very often, and that it should be wise to keep it’s value in one of the processor’s internal register.
Nowadays, optimizers are much more efficient than programmers to determine variables that are more likely to be kept into registers, and the optimizer does not always take the programmer’s hint into account.
So many people wrongly recommend not to use the register keyword.
Let’s see why!
The register keyword has an associated side effect: you can not reference (get the address of) a register type variable.
People advising others not to use registers takes wrongly this as an additional argument.
However, the simple fact of knowing that you can not take the address of a register variable, allows the compiler (and its optimizer) to know that the value of this variable can not be modified indirectly through a pointer.
When at a certain point of the instruction stream, a register variable has its value assigned in a processor’s register, and the register has not been used since to get the value of another variable, the compiler knows that it does not need to re-load the value of the variable in that register.
This allows to avoid expensive useless memory access.
Do your own tests and you will get significant performance improvements in your most inner loops.
c_register_side_effect_performance_boost

Register would notify the compiler that the coder believed this variable would be written/read enough to justify its storage in one of the few registers available for variable use. Reading/writing from registers is usually faster and can require a smaller op-code set.
Nowadays, this isn't very useful, as most compilers' optimizers are better than you at determining whether a register should be used for that variable, and for how long.

gcc 9.3 asm output, without using optimisation flags (everything in this answer refers to standard compilation without optimisation flags):
#include <stdio.h>
int main(void) {
int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
#include <stdio.h>
int main(void) {
register int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
add ebx, 1
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
add rsp, 8
pop rbx
pop rbp
ret
This forces ebx to be used for the calculation, meaning it needs to be pushed to the stack and restored at the end of the function because it is callee saved. register produces more lines of code and 1 memory write and 1 memory read (although realistically, this could have been optimised to 0 R/Ws if the calculation had been done in esi, which is what happens using C++'s const register). Not using register causes 2 writes and 1 read (although store to load forwarding will occur on the read). This is because the value has to be present and updated directly on the stack so the correct value can be read by address (pointer). register doesn't have this requirement and cannot be pointed to. const and register are basically the opposite of volatile and using volatile will override the const optimisations at file and block scope and the register optimisations at block-scope. const register and register will produce identical outputs because const does nothing on C at block-scope, so only the register optimisations apply.
On clang, register is ignored but const optimisations still occur.

On supported C compilers it tries to optimize the code so that variable's value is held in an actual processor register.

Microsoft's Visual C++ compiler ignores the register keyword when global register-allocation optimization (the /Oe compiler flag) is enabled.
See register Keyword on MSDN.

Register keyword tells compiler to store the particular variable in CPU registers so that it could be accessible fast. From a programmer's point of view register keyword is used for the variables which are heavily used in a program, so that compiler can speedup the code. Although it depends on the compiler whether to keep the variable in CPU registers or main memory.

Register indicates to compiler to optimize this code by storing that particular variable in registers then in memory. it is a request to compiler, compiler may or may not consider this request.
You can use this facility in case where some of your variable are being accessed very frequently.
For ex: A looping.
One more thing is that if you declare a variable as register then you can't get its address as it is not stored in memory. it gets its allocation in CPU register.

The register keyword is a request to the compiler that the specified variable is to be stored in a register of the processor instead of memory as a way to gain speed, mostly because it will be heavily used. The compiler may ignore the request.

Related

How does Google's `DoNotOptimize()` function enforce statement ordering

I'm trying to understand exactly how Google's DoNotOptimize() is supposed to work.
For completeness, here is its definition (for clang, and non-const data):
template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
asm volatile("" : "+r,m"(value) : : "memory");
}
As I understand we can use this in code like this:
start_time = time();
bench_output = run_bench(bench_inputs);
result = time() - start_time;
To ensure that the benchmark stays in the critical section:
start_time = time();
DoNotOptimize(bench_inputs);
bench_output = run_bench(bench_inputs);
DoNotOptimise(bench_output);
result = time() - start_time;
Specifically what I don't understand is why this guarantees (does it?) that run_bench() is not moved above start_time = time().
(Someone asked exactly this in this comment, however I don't understand the answer).
As I understand, the above DoNotOptimze() does several things:
It forces value to the stack, as it is passed by C++ reference. You can't have a pointer to a register, so it must be in memory.
Because value is now on the stack, subsequently clobbering memory (as done in the asm constraints) will force the compiler to assume that value is both read and written by the call to DoNotOptimize(value).
(it's not clear to me if the +r,m constraint is relevant. As far as I know this says that the pointer itself may be stored in a register or in memory, but the pointer value itself may be read and/or written.)
And this is where things get fuzzy for me.
If start_time is also stack allocated, the memory clobbering in DoNotOptimize() will mean that the compiler must assume that DoNotOptimize() potentially reads start_time. Therefore the order of the statements can only be:
start_time = time(); // on the stack
DoNotOptimize(bench_inputs); // reads start_time, writes bench_inputs
bench_output = run_bench(bench_inputs)
But if start_time is not stored in memory, but instead in a register, then clobbering memory will not clobber start_time, right? In that case the desired ordering of start_time = time() and DoNotOptimize(bench_inputs) is lost and the compiler is free to do:
DoNotOptimize(bench_inputs); // only writes bench_inputs
bench_output = run_bench(bench_inputs)
start_time = time(); // in a register
Obviously I've misunderstood something. Can anyone help explain? Thanks :)
I'm wondering if this is because reordering optimisations happen prior to register allocation, and thus everything is assumed to be stack allocated at that time. But if that were the case, then DoNotOptimize() would be redundant, as ClobberMemory() would be sufficient.
Summary: DoNotOptimize is ordered wrt. time() by the the "memory" clobbers, as if it were another function call to an opaque function that could modify any global state.
DoNotOptimize is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory" clobber is irrelevant for this part.
"memory" clobber is like a non-inline function call
DoNotOptimize's asm statement contains a "memory" clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object1. (Even ones this compilation unit might not know about.)
Since time() itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize at compile time for the same reason that a compiler can't reorder calls to foo() and bar() when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");.
(A hypothetical time() that could inline and only contained an asm statement would have to use asm volatile to make sure repeated calls didn't just use the first one's output. asm volatile statements can't reorder with each other or accesses to volatile variables, so that would be ok too, for a different reason.)
Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new, if escape analysis can prove that nothing outside this function could have pointers to them.
How the asm statement works
I think you're pretty seriously misunderstanding how the asm works. "+r,m" tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.
So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+" modifier declares a read/write operand.
The point of DoNotOptimize on the input is to defeat constant-propagation that would let the benchmark optimize away.
And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile is relevant; defeating constant-propagation still works with non-volatile asm.)
So the computation you want to benchmark has to happen between the two DoNotOptimize() statements, and separately those two statements can't reorder with time().
The compiler has to assume that the asm statement modifies the value like val ^= random for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory" clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)
Misconceptions about the reference arg and picking "m"
I only got part way into the details of your attempt to reason about the "+r,m" operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:
The C++ function containing the asm statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline)) to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)
The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo) is like asm volatile("" : "+r,m"(foo) :: "memory")
The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)
For example, we can see that GCC does choose to do that. (I guess I could have used incl %0 as the example, but I just chose nop as a way to show what the compiler picked for the operand location as an alternative to # %0 pure comment, so the Godbolt compiler explorer wouldn't filter it out.)
void foo(int *p)
{
asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
}
# GCC 11.2 -O2
foo(int*):
movl 16(%rdi), %eax
nop # operand picked %eax
movl %eax, 16(%rdi)
ret
vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).
# clang 12.0.1 -O2 -fPIE
foo(int*): # #foo(int*)
nop # operand picked 16(%rdi)
retq
Fun fact: "r,m" is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm" constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.

When initializing variables in C does the compiler store the intial value in the executable

When Initializing variables in C I was wondering does the compiler make it so that when the code is loaded at run time the value is already set OR does it have to make an explicit assembly call to set it to an initial value?
The first one would be slightly more efficient because you're not making a second CPU call just to set the value.
e.g.
void foo() {
int c = 1234;
}
A compiler is not required to do either of them. As long as the behavior of the program stays the same it can pretty much do whatever it wants.
Especially when using optimization, crazy stuff can happen. Looking at the assembly code after heavy optimization can be confusing to say the least.
In your example, both the constant 1234 and the variable c would be optimized away since they are not used.
If it's a variable with static lifetime, it'll typically become part of the executable's static image, which'll get memcpy'ed, along with other statically known data, into the process's allocated memory when the process is started/loaded.
void take_ptr(int*);
void static_lifetime_var(void)
{
static int c = 1234;
take_ptr(&c);
}
x86-64 assembly from gcc -Os:
static_lifetime_var:
mov edi, OFFSET FLAT:c.1910
jmp take_ptr
c.1910:
.long 1234
If it's unused, it'll typically vanish:
void unused(void)
{
int c = 1234;
}
x86-64 assembly from gcc -Os:
unused:
ret
If it is used, it may not be necessary to put it into the function's frame (its local variables on the stack)—it might be possible to directly embed it into an assembly instruction, or "use it as an immediate":
void take_int(int);
void used_as_an_immediate(int d)
{
int c = 1234;
take_int(c*d);
}
x86-64 assembly from gcc -Os:
used_as_an_immediate:
imul edi, edi, 1234
jmp take_int
If it is used as a true local, it'll need to be loaded into stack-allocated space:
void take_ptr(int*);
void used(int d)
{
int c = 1234;
take_ptr(&c);
}
x86-64 assembly from gcc -Os:
used:
sub rsp, 24
lea rdi, [rsp+12]
mov DWORD PTR [rsp+12], 1234
call take_ptr
add rsp, 24
ret
When pondering these things Compiler Explorer along with some basic knowledge of assembly are your friends.
TL;DR: Your examples declares and initializes an automatic variable. It has to be initialized each time the function is called. So there will be some instruction to do this.
As an adjusted duplicate of my answer to How compile time initialization of variables works internally in c?:
The standard defines no exact way of initialization. It depends on the environment your code is developed and run on.
How variables are initialized depends also on their storage duration. You didn't mention it in the text, your example is an automatic variable. (Which is most probably optimized away, as commenters point out.)
Initialized automatic variables will be written each time their declaration is reached. The compiled program executes some machine code for this to happen.
Static variables are always intialized and only once before the program startup.
Examples from the real world:
Most (if not all) PC systems store the initial values of explicitly (and not zero-) initialized static variables in a special section called data that is loaded by the system's loader to RAM. That way those variables get their values before the program startup. Static variables not explicitly initialized or with zero-like values are placed in a section bss and are filled with zeroes by the startup code before program startup.
Many embedded systems have their program in non-volatile memory that can't be changed. On such systems the startup code copies the initial values of the section data into its allocated space in RAM, producing a similar result. The same startup code zeroes also the section bss.
Note 1: The sections don't have to be named liked this. But it is common.
This startup code might be part of the compiled program, or might not. It depends, see above. But speaking of efficience it doesn't matter which program initializes variables. It just has to be done.
Note 2: There are more kinds of storage duration, please see chapter 6.2.4 of the standard.
As long as the standard is met, a system is free to implement any other kind of initialization, including writing the initial values into their variables step by step.
Firstly, its important to have a common understanding of the word 'compiler', else we can end-up arguing endlessly.
In simple words,
a compiler is a computer program that translates computer code written
in one programming language (the source language) into another
programming language (the target language). The name compiler is
primarily used for programs that translate source code from a
high-level programming language to a lower level language (e.g.,
assembly language, object code, or machine code) to create an
executable program.
(Ref: Wikipedia)
With this common understanding, let's now find answer you question: The answer is 'yes, the final code contains explicit assembly call to set it to an initial value' for any kind of variables. It is so because finally the variables are either stored in some memory location, or they live in some CPU register in case the number of variables are so less that the variable can be accommodated into some CPU registers such as your code snippet running on lets say most modern servers (Side note: different systems have different number of registers).
For the variables that are stored in registers, there has to be a mov (or equivalent) kind of instruction to load that initial value into the register. Without such an instruction the assigned register cannot be assigned with the intended value.
For the variables that are stored in memory, depending on the architecture and the compiler efficiency, the init value has to be somehow pushed to the given/assigned address, which takes at least a couple of asm instructions.
Does this answer your question?

How do make sure if a variable defined with "register" specifier got stored in CPU register?

I want to know, How do we make sure if a variable defined with register specifier got stored in CPU register?
Basically, you cannot. There is absolutely nothing in the C standard that gives you the control.
Using the register keyword is giving the compiler a hint that that the variable maybe stored into a register (i.e., allowed fastest possible access). Compiler is free to ignore it. Each compiler can have a different way of accepting/rejecting the hint.
Quoting C11, chapter §6.7.1, (emphasis mine)
A declaration of an identifier for an object with storage-class specifier register
suggests that access to the object be as fast as possible. The extent to which such
suggestions are effective is implementation-defined.
FWIW, most modern-day compilers can detect the mostly-used variables and allocate them in actual register, if required. Remember, CPU register is a scarce resource.
Disassemble the code and check. It may not really be clear at that point, because variables don't really exist, they're just names that link producers with consumers. So, there is not necessarily a register reserved for that variable - maybe it disappeared entirely, maybe it lives in several registers over its lifetime, maybe none of the above.
Historically, the register keyword was introduced decades ago as an optimization hint to the compiler. Nowadays, when the processors have more general purpose registers, the compiler usually places variables in registers even without being told so (when the code is compiled with optimizations).
Being just a hint and not an enforcement, you cannot do anything to force it. You can, however, write that part of the code in assembler. This way you have complete control of where your variables are stored.
If the variable is stored in a register means it is not stored in memory.
So, the bull's eye is try to access the address of the variable using printf. If the output gives some address, conclusion is it is stored in memory so it would act as auto storage class variable (and it is not stored in register).
But if it gives error "incompatible implicit declaration of built-in function 'printf' "..this means the variable is stored in register and would behave as register storage class variable..
Maybe calling Assembly instructions will help with it:
/// Function must be something like this:
int check_register_storing()
{
__asm__ (
pushad // Save registers
and ebx, ebx // Set Zero
and eax, eax
and ecx, ecx
and edx, edx
);
// Set test number.
register int a = 8; // Initial value;
int from_register = 0;
asm(
add eax, ebx // If, 'a' variable set on CPU register,
add eax, ecx // Some of main usage registers must contain 8
add eax, edx // Others must contain 0
mov %from_register, eax
popad // Return default parameters to registers
}
/// Check result
printf( "Original saved number: %d, Returned number from main registers: %d\n", a, from_register );
}
I don't know if I am wrong or right, but we know that a normal variable is stored in memory which has address, but we know that if we write register int a; then a register may be allocated for the variable, but we know that registers have names, not address, so we can't assign pointer to point to a registers because pointers stores address only, so if we write as follow:-
#include<stdio.h>
int main()
{
register int reg = 5;
int *p = ®
printf("%d",reg);
}
then it should give error as: "address of register variable ‘reg’ requested" if the register has successfully allocated to our variable, and if register is not allocated then memory addres can be assigned to pointer hence no error should be there.
IMPORTANT:-This is my first answer on stackoverflow, I may be wrong, please correct me if i am, I'm still learning.

Is accessing statically or dynamically allocated memory faster?

There are 2 ways of allocating global array in C:
statically
char data[65536];
dynamically
char *data;
…
data = (char*)malloc(65536); /* or whatever size */
The question is, which method has better performance? And by how much?
As understand it, the first method should be faster.
Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:
read the variable data which contains the pointer to the beginning of the array
calculate the offset to specific element
access the element
With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:
calculate the offset to specific element from fixed address defined at compile time
access the element of the array
Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.
Is this correct?
One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)
If the access is sporadic the static/global will be slightly faster:
// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{
data_p[idx] = val;
}
// faster because we can store directly
void
datasets(int idx,char val)
{
data_s[idx] = val;
}
Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.
However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:
void
datasetalls(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_s[idx] = val;
}
void
datasetallp(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_p[idx] = val;
}
void
datasetallp_optimized(char val)
{
int idx;
register char *reg;
// the optimizer will generate the equivalent code to this
reg = data_p;
for (idx = 0; idx < 65536; ++idx)
reg[idx] = val;
}
If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.
If such eviction occurs, the actual data array will, most likely, be evicted as well.
A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.
But, in that case, the extra fetch of data_p would be 0.000001% overhead.
So, it helps to either benchmark [or model] the particular use case/access pattern.
UPDATE:
Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.
Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.
If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.
Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:
// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;
// closer to real world
struct mystruct {
...
char *data_p;
...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;
These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].
For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".
Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:
// does fetch of data_p once at loop start
int
datasumallp(void)
{
int idx;
int sum;
sum = 0;
for (idx = 0; idx < 65536; ++idx)
sum += data_p[idx];
return sum;
}
But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.
Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.
void
dataincallx(array_t *data,int val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data[idx] = val + idx;
}
This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].
So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.
It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.
This is use case dependent.
If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.
However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.
Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x; and then no matter where data is stored, the program has to load a base address and calculate an offset.
The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x; which is probably a far less likely use case.
At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.
Each memory access is equivalent to about 40 CPU clock cycles
What!? What CPU is that? Some pre-ancient computer from 1960?
Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.
There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free() it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.
I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)
But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.
There is a small effect here. It's unlikely to be significant, but it is real. It will often take one extra instruction to resolve the extra level of indirection for a global pointer-to-a-buffer instead of a global array. For most uses, other considerations will be more important (like reuse of the same scratch space, vs giving each function its own scratch buffer). Also: avoiding compile-time size limits!
This effect is only present when you reference the global directly, rather than passing around the address as a function parameter. Inlining / whole-program link-time optimization may see all the way back to where the global is used as a function arg initially, and be able to take advantage of it throughout more code, though.
Let's compare simple test functions, compiled by clang 3.7 for x86-64 (SystemV ABI, so the first arg is in rdi). Results on other architectures will be essentially the same:
int data_s[65536];
int *data_p;
void store_s(int val) { data_s[val] = val; }
movsxd rax, edi ; sign-extend
mov dword ptr [4*rax + data_s], eax
ret
void store_p(int val) { data_p[val] = val; }
movsxd rax, edi
mov rcx, qword ptr [rip + data_p] ; the extra level of indirection
mov dword ptr [rcx + 4*rax], eax
ret
Ok, so there's overhead of one extra load. (mov r64, [rel data_p]). Depending on what other static/global objects data_p is stored near, it may tend to stay hot in cache even if we're not using it often. If it's in a cache line with no other frequently-accessed data, it's wasting most of that cache line, though.
The overhead is only paid once per function call, though, even if there's a loop. (Unless the array is an array of pointers, since C aliasing rules require the compiler to assume that any pointer might be pointing to data_p, unless it can prove otherwise. This is the main performance danger when using global pointers-to-pointers.)
If you don't use restrict, the compiler still has to assume that the buffers pointed to by int *data_p1 and int *data_p2 could overlap, though, which interferes with autovectorization, loop unrolling, and many other optimizations. Static buffers can't overlap with other static buffers, but restrict is still needed when using a static buffer and a pointer in the same loop.
Anyway, let's have a look at the code for very simple memset-style loops:
void loop_s(int val) { for (int i=0; i<65536; ++i) data_s[i] = val; }
mov rax, -262144 ; loop counter, counting up towards zero
.LBB3_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + data_s+262144], edi
add rax, 4
jne .LBB3_1
ret
Note that clang uses a non-RIP-relative effective address for data_s here, because it can.
void loop_p(int val) { for (int i=0; i<65536; ++i) data_p[i] = val; }
mov rax, qword ptr [rip + data_p]
xor ecx, ecx
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + 4*rcx], edi
add rcx, 1
cmp rcx, 65536
jne .LBB4_1
ret
Note the mov rax, qword [rip + data_p] in loop_p, and the less efficient loop structure because it uses the loop counter as an array index.
gcc 5.3 has much less difference between the two loops: it gets the start address into a register and increments it, with a compare against the end address. So it uses a one-register addressing mode for the store, which may perform better on Intel CPUs. The only difference in loop structure / overhead for gcc is that the static buffer version gets the initial pointer into a register with a mov r32, imm32, rather than a load from memory. (So the address is an immediate constant embedded in the instruction stream.)
In shared-library code, and on OS X, where all executables must be position-independent, gcc's way is the only option. Instead of mov r32, imm32 to get the address into a register, it would use lea r64, [RIP + displacement]. The opportunity to save an instruction by embedding the absolute address into other instructions is gone when you need to offset the address by a variable amount (e.g. array index). If the array index is a compile-time constant, it can be folded into the displacement for a RIP-relative load or store instead of LEA. For a loop over an array, this could only happen with full unrolling, and is thus unlikely.
Still, the extra level of indirection is still there with a pointer to dynamically allocated memory. There's still a chance of a cache or TLB miss when doing a load instead of an LEA.
Note that global data (as opposed to static) has an extra level of indirection through the global offset table, but that's on top of the indirection or lack thereof from dynamic allocation. compiling with gcc 5.3 -fPIC.
int global_data_s[65536];
int access_global_s(int i){return global_data_s[i];}
mov rax, QWORD PTR global_data_s#GOTPCREL[rip] ; load from a RIP-relative address, instead of an LEA
movsx rdi, edi
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
int *global_data_p;
int access_global_p(int i){return global_data_p[i];}
mov rax, QWORD PTR global_data_p#GOTPCREL[rip] ; extra layer of indirection through the GOT
movsx rdi, edi
mov rax, QWORD PTR [rax] ; load the pointer (the usual layer of indirection)
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
If I understand this correctly, the compiler doesn't assume that the symbol definition for global symbols in the current compilation unit are the definitions that will actually be used at link time. So the RIP-relative offset isn't a compile-time constant. Thanks to runtime dynamic linking, it's not a link-time constant either, so an extra level of indirection through the GOT is used. This is unfortunate, and I hope compilers on OS X don't have this much overhead for global variables. With -O0 -fwhole-program on godbolt, I can see that even the globals are accessed with just RIP-relative addressing, not through the GOT, so perhaps that option is even more valuable than usual when making position-independent executables. Hopefully link-time-optimization works too, because that could be used when making shared libraries.
As many other answers have pointed out, there are other important factors, like memory locality, and the overhead of actually doing the allocate/free. These don't matter much for a large buffer (multiple pages) that's allocated once at program startup. See the comments on Peter A. Schneider's answer.
Dynamic allocation can give significant benefits, though, if you end up using the same memory as scratch space for multiple different things, so it stays hot in cache. Giving each function its own static buffer for scratch space is often a bad move if they aren't needed simultaneously: the dirty memory has to get written back to main memory when it's no longer needed, and is part of the program's footprint forever.
A good way to get small scratch buffers without the overhead of malloc (or new) is to create them on the stack (e.g. as local array variables). C99 allows variable-sized local arrays, like foo(int n) { int buf[n]; ...; } Be careful not to overdo it and run out of stack space, but the current stack page is going to be hot in the TLB. The _local functions in my godbolt links allocate a variable-sized array on the stack, which has some overhead for re-aligning the stack to a 16B boundary after adding a variable size. It looks like clang takes care to mask off the sign bit, but gcc's output looks like it will just break in fun and exciting ways if n is negative. (In godbolt, use the "binary" button to get disassembler output, instead of the compiler's asm output, because the disassembly uses hex for immediate constants. e.g. clang's movabs rcx, 34359738352 is 0x7fffffff0). Even though it takes a few instructions, it's much cheaper than malloc. A medium to large allocation with malloc, like 64kiB, will typically make an mmap system call. But this is the cost of allocation, not the cost of accessing once allocated.
Having the buffer on the stack means the stack pointer itself is the base address for indexing into it. This means it doesn't take an extra register to hold that pointer, and it doesn't have to be loaded from anywhere.
If any elements are statically initialized to non-zero in a static (or global), the entire array or struct will be sitting there in the executable, which is a big waste of space if most entries should be zero at program startup. (Or if the data can be computed on the fly quickly.)
On some systems, having a gigantic array zero-initialized static array doesn't cost you anything as long as you never even read the parts you don't need. Lazy mapping of memory means the OS maps all the pages of your giant array to the same page of zeroed memory, and does copy-on-write. Taking advantage of this would be an ugly performance hack to be used only if you were sure you really wanted it, and were sure your code would never run on a system where your char data[1<<30] would actually use that much memory right away.
Each memory access is equivalent to about 40 CPU clock cycles.
This is nonsense. The latency can be anywhere from 3 or 4 cycles (L1 cache hit) to multiple hundreds of cycles (main memory), or even a page fault requiring a disk access. Other than a page fault, much of this latency can overlap with other work, so the impact on throughput can be much lower. A load from a constant address can start as soon as the instruction issues into the out-of-order core, since it's the start of a new dependency chain. The size of the out-of-order window is limited (an Intel Skylake core has a Re-Order Buffer of 224 uops, and can have 72 loads in flight at once). A full cache miss (or worse, a TLB miss followed by a cache miss) often does stall out-of-order execution. See http://agner.org/optimize/, and other links in the x86 wiki. Also see this blog post about the impact of ROB size on how many cache misses can be in flight at once.
Out-of-order cores for other architectures (like ARM and PPC) are similar, but in-order cores suffer more from cache misses because they can't do anything else while waiting. (Big x86 cores like Intel and AMD's mainstream microarchitectures (not the low-power Silvermont or Jaguar microarchitectures) have more out-of-order execution resources than other designs, though. AFAIK, most ARM cores have significantly smaller buffers for starting independent loads early and/or hiding cache-miss latency.)
I would say you really should profile it. Theoretically you are right but there are some basic things you have to remember.
Language C is a high-level language like many there exist today and you tell the machine what to do. Getting closer to machine code would be considering ASM or similar. If you build code, through compiling and linking or whatever, the compiler will try the best to correctly run what you demand and optimize it (unless you don't want that). Remember, there also exist concepts like Just-In-Time compilation (JIT).
So I consider it hard to answer your question. For one thing you can be sure. A static array will most likely be faster especially with the size of 65536 because there are more chances of optimization for the compiler. This might depend on what size you defined. For GCC 65536 bytes seems to be common for stacks and caches, not sure. Some compilers might even tell you the array is too big, because they try to keep it in other memory hierarchies like caches which also are faster than Random Access Memory.
Last but not least remember that modern operating systems also have their memory management using virtual memory.
Static memory can be stored in data segments and will most likely be loaded when the program is executed, but remember this is also time you have to consider. Allocate the memory by the OS when the program is started or do it at runtime? It really depends on your application.
So I think you really should benchmark your results and see by how much faster it is. But as tendency I would say your static array will compile a code that is going to run faster.

who decide actual storage of register storage class?

Recently I have been asked following Q in a interview :
Who actually decide that where register variable will be store(in RAM or register).
I have searched on google, the ans I got that compiler decide.
but how can compiler decide ? It should be decided at run time as per my understanding.
what if we compile and run program in different machine , then how can compiler decide where to store register storage class value.
The register storage class specifier is a hint to the compiler that accesses to a variable ought to be “as fast as possible”, implying (on register architectures) that its storage should be allocated to a register. This prohibits a few things, like taking the address of the variable—registers don’t have addresses. However, the compiler is free to ignore this hint (§6.7.1¶5):
A declaration of an identifier for an object with storage-class specifier register suggests that access to the object be as fast as possible. The extent to which such suggestions are effective is implementation-defined.
When compiling your code, the compiler must choose how to map local variables and arithmetic operations into operations on CPU registers and stack memory. This is called register allocation; these decisions are made at compile time, and baked into the compiled code of a function.
Suppose we have a very naïve compiler that does precisely what we say, and doesn’t optimise anything. If we give it this code:
int x;
x = 2;
x += 5;
Then we might expect to see this output, on an x86 machine:
sub esp, 4 ; allocate stack space for an integer
mov dword [esp], 2
add dword [esp], 5
But if we write:
register int x;
x = 2;
x += 5;
Then we might expect to see:
mov eax, 2
add eax, 5
The latter is more efficient because it prefers register accesses over memory accesses. In practice, contemporary compilers have intelligent register allocation algorithms that make this storage class specifier unnecessary.
Several types of optimisations are made by compiler,during compile time and depending upon these optimisations, the request is granted or refused.
The third last phase of compilation --- intermediate code generation keeps the basis of generating an intermediate three-address(opcode) based
coding,which is further optimised in the second last phase of
compiler-optimisation. The last phase of compiler --- target
code generation makes it guaranteed whether the register storage
class variable will be granted the register or not.
The request of granting register access to the variable is made by the program,but,finally, it is the compiler who decides the allocation of variable's memory in the register depending on :-
Availability of registers in the CPU.
More stable optimisations,etc.

Resources