Is accessing statically or dynamically allocated memory faster?

Is accessing statically or dynamically allocated memory faster? - c

There are 2 ways of allocating global array in C:
statically
char data[65536];
dynamically
char *data;
…
data = (char*)malloc(65536); /* or whatever size */
The question is, which method has better performance? And by how much?
As understand it, the first method should be faster.
Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:
read the variable data which contains the pointer to the beginning of the array
calculate the offset to specific element
access the element
With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:
calculate the offset to specific element from fixed address defined at compile time
access the element of the array
Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.
Is this correct?

One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)
If the access is sporadic the static/global will be slightly faster:
// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{
data_p[idx] = val;
}
// faster because we can store directly
void
datasets(int idx,char val)
{
data_s[idx] = val;
}
Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.
However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:
void
datasetalls(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_s[idx] = val;
}
void
datasetallp(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_p[idx] = val;
}
void
datasetallp_optimized(char val)
{
int idx;
register char *reg;
// the optimizer will generate the equivalent code to this
reg = data_p;
for (idx = 0; idx < 65536; ++idx)
reg[idx] = val;
}
If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.
If such eviction occurs, the actual data array will, most likely, be evicted as well.
A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.
But, in that case, the extra fetch of data_p would be 0.000001% overhead.
So, it helps to either benchmark [or model] the particular use case/access pattern.
UPDATE:
Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.
Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.
If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.
Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:
// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;
// closer to real world
struct mystruct {
...
char *data_p;
...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;
These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].
For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".
Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:
// does fetch of data_p once at loop start
int
datasumallp(void)
{
int idx;
int sum;
sum = 0;
for (idx = 0; idx < 65536; ++idx)
sum += data_p[idx];
return sum;
}
But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.
Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.
void
dataincallx(array_t *data,int val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data[idx] = val + idx;
}
This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].
So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.
It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.
This is use case dependent.
If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.
However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.

Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x; and then no matter where data is stored, the program has to load a base address and calculate an offset.
The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x; which is probably a far less likely use case.
At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.
Each memory access is equivalent to about 40 CPU clock cycles
What!? What CPU is that? Some pre-ancient computer from 1960?
Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.
There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free() it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.

I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)
But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.

There is a small effect here. It's unlikely to be significant, but it is real. It will often take one extra instruction to resolve the extra level of indirection for a global pointer-to-a-buffer instead of a global array. For most uses, other considerations will be more important (like reuse of the same scratch space, vs giving each function its own scratch buffer). Also: avoiding compile-time size limits!
This effect is only present when you reference the global directly, rather than passing around the address as a function parameter. Inlining / whole-program link-time optimization may see all the way back to where the global is used as a function arg initially, and be able to take advantage of it throughout more code, though.
Let's compare simple test functions, compiled by clang 3.7 for x86-64 (SystemV ABI, so the first arg is in rdi). Results on other architectures will be essentially the same:
int data_s[65536];
int *data_p;
void store_s(int val) { data_s[val] = val; }
movsxd rax, edi ; sign-extend
mov dword ptr [4*rax + data_s], eax
ret
void store_p(int val) { data_p[val] = val; }
movsxd rax, edi
mov rcx, qword ptr [rip + data_p] ; the extra level of indirection
mov dword ptr [rcx + 4*rax], eax
ret
Ok, so there's overhead of one extra load. (mov r64, [rel data_p]). Depending on what other static/global objects data_p is stored near, it may tend to stay hot in cache even if we're not using it often. If it's in a cache line with no other frequently-accessed data, it's wasting most of that cache line, though.
The overhead is only paid once per function call, though, even if there's a loop. (Unless the array is an array of pointers, since C aliasing rules require the compiler to assume that any pointer might be pointing to data_p, unless it can prove otherwise. This is the main performance danger when using global pointers-to-pointers.)
If you don't use restrict, the compiler still has to assume that the buffers pointed to by int *data_p1 and int *data_p2 could overlap, though, which interferes with autovectorization, loop unrolling, and many other optimizations. Static buffers can't overlap with other static buffers, but restrict is still needed when using a static buffer and a pointer in the same loop.
Anyway, let's have a look at the code for very simple memset-style loops:
void loop_s(int val) { for (int i=0; i<65536; ++i) data_s[i] = val; }
mov rax, -262144 ; loop counter, counting up towards zero
.LBB3_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + data_s+262144], edi
add rax, 4
jne .LBB3_1
ret
Note that clang uses a non-RIP-relative effective address for data_s here, because it can.
void loop_p(int val) { for (int i=0; i<65536; ++i) data_p[i] = val; }
mov rax, qword ptr [rip + data_p]
xor ecx, ecx
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + 4*rcx], edi
add rcx, 1
cmp rcx, 65536
jne .LBB4_1
ret
Note the mov rax, qword [rip + data_p] in loop_p, and the less efficient loop structure because it uses the loop counter as an array index.
gcc 5.3 has much less difference between the two loops: it gets the start address into a register and increments it, with a compare against the end address. So it uses a one-register addressing mode for the store, which may perform better on Intel CPUs. The only difference in loop structure / overhead for gcc is that the static buffer version gets the initial pointer into a register with a mov r32, imm32, rather than a load from memory. (So the address is an immediate constant embedded in the instruction stream.)
In shared-library code, and on OS X, where all executables must be position-independent, gcc's way is the only option. Instead of mov r32, imm32 to get the address into a register, it would use lea r64, [RIP + displacement]. The opportunity to save an instruction by embedding the absolute address into other instructions is gone when you need to offset the address by a variable amount (e.g. array index). If the array index is a compile-time constant, it can be folded into the displacement for a RIP-relative load or store instead of LEA. For a loop over an array, this could only happen with full unrolling, and is thus unlikely.
Still, the extra level of indirection is still there with a pointer to dynamically allocated memory. There's still a chance of a cache or TLB miss when doing a load instead of an LEA.
Note that global data (as opposed to static) has an extra level of indirection through the global offset table, but that's on top of the indirection or lack thereof from dynamic allocation. compiling with gcc 5.3 -fPIC.
int global_data_s[65536];
int access_global_s(int i){return global_data_s[i];}
mov rax, QWORD PTR global_data_s#GOTPCREL[rip] ; load from a RIP-relative address, instead of an LEA
movsx rdi, edi
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
int *global_data_p;
int access_global_p(int i){return global_data_p[i];}
mov rax, QWORD PTR global_data_p#GOTPCREL[rip] ; extra layer of indirection through the GOT
movsx rdi, edi
mov rax, QWORD PTR [rax] ; load the pointer (the usual layer of indirection)
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
If I understand this correctly, the compiler doesn't assume that the symbol definition for global symbols in the current compilation unit are the definitions that will actually be used at link time. So the RIP-relative offset isn't a compile-time constant. Thanks to runtime dynamic linking, it's not a link-time constant either, so an extra level of indirection through the GOT is used. This is unfortunate, and I hope compilers on OS X don't have this much overhead for global variables. With -O0 -fwhole-program on godbolt, I can see that even the globals are accessed with just RIP-relative addressing, not through the GOT, so perhaps that option is even more valuable than usual when making position-independent executables. Hopefully link-time-optimization works too, because that could be used when making shared libraries.
As many other answers have pointed out, there are other important factors, like memory locality, and the overhead of actually doing the allocate/free. These don't matter much for a large buffer (multiple pages) that's allocated once at program startup. See the comments on Peter A. Schneider's answer.
Dynamic allocation can give significant benefits, though, if you end up using the same memory as scratch space for multiple different things, so it stays hot in cache. Giving each function its own static buffer for scratch space is often a bad move if they aren't needed simultaneously: the dirty memory has to get written back to main memory when it's no longer needed, and is part of the program's footprint forever.
A good way to get small scratch buffers without the overhead of malloc (or new) is to create them on the stack (e.g. as local array variables). C99 allows variable-sized local arrays, like foo(int n) { int buf[n]; ...; } Be careful not to overdo it and run out of stack space, but the current stack page is going to be hot in the TLB. The _local functions in my godbolt links allocate a variable-sized array on the stack, which has some overhead for re-aligning the stack to a 16B boundary after adding a variable size. It looks like clang takes care to mask off the sign bit, but gcc's output looks like it will just break in fun and exciting ways if n is negative. (In godbolt, use the "binary" button to get disassembler output, instead of the compiler's asm output, because the disassembly uses hex for immediate constants. e.g. clang's movabs rcx, 34359738352 is 0x7fffffff0). Even though it takes a few instructions, it's much cheaper than malloc. A medium to large allocation with malloc, like 64kiB, will typically make an mmap system call. But this is the cost of allocation, not the cost of accessing once allocated.
Having the buffer on the stack means the stack pointer itself is the base address for indexing into it. This means it doesn't take an extra register to hold that pointer, and it doesn't have to be loaded from anywhere.
If any elements are statically initialized to non-zero in a static (or global), the entire array or struct will be sitting there in the executable, which is a big waste of space if most entries should be zero at program startup. (Or if the data can be computed on the fly quickly.)
On some systems, having a gigantic array zero-initialized static array doesn't cost you anything as long as you never even read the parts you don't need. Lazy mapping of memory means the OS maps all the pages of your giant array to the same page of zeroed memory, and does copy-on-write. Taking advantage of this would be an ugly performance hack to be used only if you were sure you really wanted it, and were sure your code would never run on a system where your char data[1<<30] would actually use that much memory right away.
Each memory access is equivalent to about 40 CPU clock cycles.
This is nonsense. The latency can be anywhere from 3 or 4 cycles (L1 cache hit) to multiple hundreds of cycles (main memory), or even a page fault requiring a disk access. Other than a page fault, much of this latency can overlap with other work, so the impact on throughput can be much lower. A load from a constant address can start as soon as the instruction issues into the out-of-order core, since it's the start of a new dependency chain. The size of the out-of-order window is limited (an Intel Skylake core has a Re-Order Buffer of 224 uops, and can have 72 loads in flight at once). A full cache miss (or worse, a TLB miss followed by a cache miss) often does stall out-of-order execution. See http://agner.org/optimize/, and other links in the x86 wiki. Also see this blog post about the impact of ROB size on how many cache misses can be in flight at once.
Out-of-order cores for other architectures (like ARM and PPC) are similar, but in-order cores suffer more from cache misses because they can't do anything else while waiting. (Big x86 cores like Intel and AMD's mainstream microarchitectures (not the low-power Silvermont or Jaguar microarchitectures) have more out-of-order execution resources than other designs, though. AFAIK, most ARM cores have significantly smaller buffers for starting independent loads early and/or hiding cache-miss latency.)

I would say you really should profile it. Theoretically you are right but there are some basic things you have to remember.
Language C is a high-level language like many there exist today and you tell the machine what to do. Getting closer to machine code would be considering ASM or similar. If you build code, through compiling and linking or whatever, the compiler will try the best to correctly run what you demand and optimize it (unless you don't want that). Remember, there also exist concepts like Just-In-Time compilation (JIT).
So I consider it hard to answer your question. For one thing you can be sure. A static array will most likely be faster especially with the size of 65536 because there are more chances of optimization for the compiler. This might depend on what size you defined. For GCC 65536 bytes seems to be common for stacks and caches, not sure. Some compilers might even tell you the array is too big, because they try to keep it in other memory hierarchies like caches which also are faster than Random Access Memory.
Last but not least remember that modern operating systems also have their memory management using virtual memory.
Static memory can be stored in data segments and will most likely be loaded when the program is executed, but remember this is also time you have to consider. Allocate the memory by the OS when the program is started or do it at runtime? It really depends on your application.
So I think you really should benchmark your results and see by how much faster it is. But as tendency I would say your static array will compile a code that is going to run faster.

Related

Using the callstack to implement a stack data structure in C?

My understanding of the memory structure under C is a program's memory is split with the stack and the heap each growing from either end of the block conceivably allocating all of ram but obviously abstracted to some kind of OS memory fragment manager thing.
Stack designed for handling local variables (automatic storage) and heap for memory allocation (dynamic storage).
(Editor's note: there are C implementations where automatic storage doesn't use a "call stack", but this question assumes a normal modern C implementation on a normal CPU where locals do use the callstack if they can't just live in registers.)
Say I want to implement a stack data structure for some data parsing algorithm. Its lifetime and scope is limited to one function.
I can think of 3 ways to do such a thing, yet none of them seem to me as the cleanest way to go about this given the scenario.
My first though is to construct a stack in the heap, like C++ std::vector:
Some algorithm(Some data)
{
Label *stack = new_stack(stack_size_estimate(data));
Iterator i = some_iterator(data);
while(i)
{
Label label = some_label(some_iterator_at(i));
if (label_type_a(label))
{
push_stack(stack,label);
}
else if(label_type_b(label))
{
some_process(&data,label,pop_stack(stack));
}
i = some_iterator_next(i);
}
some_stack_cleanup(&data,stack);
delete_stack(stack);
return data;
}
This method is alright but it's wasteful in that the stack size is a guess and at any moment push_stack could call some internal malloc or realloc and cause irregular slowdowns. None of which are problems for this algorithm, but this construct seems better suited for applications in which a stack has to be maintained across multiple contexts. That isn't the case here; the stack is private to this function and is deleted before exit, just like automatic storage class.
My next thought is recursion. Because recursion uses the builtin stack this seems closer to what I want.
Some algorithm(Some data)
{
Iterator i = some_iterator(data);
return some_extra(algorithm_helper(extra_from_some(data),&i);
}
Extra algorithm_helper(Extra thing, Iterator* i)
{
if(!*i)
{return thing;}
{
Label label = some_label(some_iterator_at(i));
if (label_type_a(label))
{
*i = some_iterator_next(*i);
return algorithm_helper
( extra_process( algorithm_helper(thing,i), label), i );
}
else if(label_type_b(label))
{
*i = some_iterator_next(*i);
return extra_attach(thing,label);
}
}
}
This method saves me from writing and maintaining a stack. The code, to me, seems harder to follow, not that it matters to me.
My main issue with it is this is using way more space.
With stack frames holding copies of this Extra construct (which basically contains the Some data plus the actual bits wanted to be held in the stack) and unnecessary copies of the exact same iterator pointer in every frame: because it's "safer" then referencing some static global (and I don't know how to not do it this way). This wouldn't be a problem if the compiler did some clever tail recursion like thing but I don't know if I like crossing my fingers and hope my compiler is awesome.
The third way I can think of involves some kind of dynamic array thing that can grow on the stack being the last thing there written using some kind of C thing I don't know about.
Or an extern asm block.
Thinking about this, that's what I'm looking for but I don't see my self writing an asm version unless it's dead simple and I don't see that being easier to write or maintain despite it seeming simpler in my head. And obviously it wouldn't be portable across ISAs.
I don't know if I'm overlooking some feature or if I need to find another language or if I should rethink my life choices. All could be true... I hope it's just the first one.
I'm not opposed to using some library. Is there one, and if so how does it work? I didn't find anything in my searches.
I recently learned about Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference, but also I can't imagine them working that way.

tl; dr: use std::vector or an equivalent.
(Edited)
Regarding your opening statement: The days of segments are over. These days processes have multiple stacks (one for each thread), but all share one heap.
Regarding option 1: Instead of writing and maintaining a stack, and guessing its size, you should just literally use std::vector, or a C wrapper around it, or a C clone of it - in any case, use the 'vector' data structure.
Vector's algorithm is generally quite efficient. Not perfect, but generally good for many, may real-world use cases.
Regarding option 2: You are right, at least as long as the discussion is confined to C. In C, recursion is both wasteful and non-scalable. In some other languages, notably in functional languages, recursion is the way to express these algorithms, and tail-call optimization is part of the language definition.
Regarding option 3: The closest to that C thing you're looking for is alloca(). It allows you to grow the stack frame, and if the stack doesn't have enough memory, the OS will allocate it. However, it's going to be quite difficult to build a stack object around it, since there's no realloca(), as pointed out by #Peter Cordes.
The Other drawback is that stacks are still limited. On Linux, the stack is typically limited to 8 MB. This is the same scalability limitation as with recursion.
Regarding variable length arrays: VLAs are basically syntactic sugar, a notation convenience. Beyond syntax, they have the exact same capabilities of arrays (actually, even fewer, viz. sizeof() doesn't work), let alone the dynamic power of std::vector.

In practice, if you can't set a hard upper bound on possible size of less than 1kiB or so, you should normally just dynamically allocate. If you can be sure the size is that small, you could consider using alloca as the container for your stack.
(You can't conditionally use a VLA, it has to be in-scope. Although you could have its size be zero by declaring it after an if(), and set a pointer variable to the VLA address, or to malloc. But alloca would be easier.)
In C++ you'd normally std::vector, but it's dumb because it can't / doesn't use realloc (Does std::vector *have* to move objects when growing capacity? Or, can allocators "reallocate"?). So in C++ it's a tradeoff between more efficient growth vs. reinventing the wheel, although it's still amortized O(1) time. You can mitigate most of it with a fairly large reserve() up front, because memory you alloc but never touch usually doesn't cost anything.
In C you have to write your own stack anyway, and realloc is available. (And all C types are trivially copyable, so there's nothing stopping you from using realloc). So when you do need to grow, you can realloc the storage. But if you can't set a reasonable and definitely-large-enough upper bound on function entry and might need to grow, then you should still track capacity vs. in-use size separately, like std::vector. Don't call realloc on every push/pop.
Using the callstack directly as a stack data structure is easy in pure assembly language (for ISAs and ABIs that use a callstack i.e. "normal" CPUs like x86, ARM, MIPS, etc). And yes, in asm worth doing for stack data structures that you know will be very small, and not worth the overhead of malloc / free.
Use asm push or pop instructions (or equivalent sequence for ISAs without a single-instruction push / pop). You can even check the size / see if the stack data structure is empty by comparing against a saved stack-pointer value. (Or just maintain an integer counter along side your push/pops).
A very simple example is the inefficient way some people write int->string functions. For non-power-of-2 bases like 10, you generate digits in least-significant first order by dividing by 10 to remove them one at a time, with digit = remainder. You could just store into a buffer and decrement a pointer, but some people write functions that push in the divide loop and then pop in a second loop to get them in printing order (most-significant first). e.g. Ira's answer on How do I print an integer in Assembly Level Programming without printf from the c library? (My answer on the same question shows the efficient way which is also simpler once you grok it.)
It doesn't particularly matter that the stack grows towards the heap, just that there is some space you can use. And that stack memory is already mapped, and normally hot in cache. This is why we might want to use it.
Stack above heap happens to be true under GNU/Linux for example, which normally puts the main thread's user-space stack near the top of user-space virtual address space. (e.g. 0x7fff...) Normally there's a stack-growth limit that's much smaller than the distance from stack to heap. You want an accidental infinite recursion to fault early, like after consuming 8MiB of stack space, not drive the system to swapping as it uses gigabytes of stack. Depending on the OS, you can increase the stack limit, e.g. ulimit -s. And thread stacks are normally allocated with mmap, the same as other dynamic allocation, so there's no telling where they'll be relative to other dynamic allocation.
AFAIK it's impossible from C, even with inline asm
(Not safely, anyway. An example below shows just how evil you'd have to get to write this in C the way you would in asm. It basically proves that modern C is not a portable assembly language.)
You can't just wrap push and pop in GNU C inline asm statements because there's no way to tell the compiler you're modifying the stack pointer. It might try to reference other local variables relative to the stack pointer after your inline asm statement changed it.
Possibly if you knew you could safely force the compiler to create a frame pointer for that function (which it would use for all local-variable access) you could get away with modifying the stack pointer. But if you want to make function calls, many modern ABIs require the stack pointer to be over-aligned before a call. e.g. x86-64 System V requires 16-byte stack alignment before a call, but push/pop work in units of 8 bytes. OTOH, 32-bit ARM (and some 32-bit x86 calling conventions, e.g. Windows) don't have that feature so any number of 4-byte pushes would leave the stack correctly aligned for a function call.
I wouldn't recommend it, though; if you want that level of optimization (and you know how to optimize asm for the target CPU), it's probably safer to write your whole function in asm.
Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference
VLAs aren't resizable. After you do int VLA[n]; you're stuck with that size. Nothing you can do in C will guarantee you more memory that's contiguous with that array.
Same problem with alloca(size). It's a special compiler built-in function that (on a "normal" implementation) decrements the stack pointer by size bytes (rounded to a multiple of the stack width) and returns that pointer. In practice you can make multiple alloca calls and they will very likely be contiguous, but there's zero guarantee of that so you can't use it safely without UB. Still, you might get away with this on some implementations, at least for now until future optimizations notice the UB and assume that your code can't be reachable.
(And it could break on some calling conventions like x86-64 System V where VLAs are guaranteed to be 16-byte aligned. An 8-byte alloca there probably rounds up to 16.)
But if you did want to make this work, you'd maybe use long *base_of_stack = alloca(sizeof(long)); (the highest address: stacks grow downward on most but not all ISAs / ABIs - this is another assumption you'd have to make).
Another problem is that there's no way to free alloca memory except by leaving the function scope. So your pop has to increment some top_of_stack C pointer variable, not actually moving the real architectural "stack pointer" register. And push will have to see whether the top_of_stack is above or below the high-water mark which you also maintain separately. If so you alloca some more memory.
At that point you might as well alloca in chunks larger than sizeof(long) so the normal case is that you don't need to alloc more memory, just move the C variable top-of-stack pointer. e.g. chunks of 128 bytes maybe. This also solves the problem of some ABIs keeping the stack pointer over-aligned. And it lets the stack elements be narrower than the push/pop width without wasting space on padding.
It does mean we end up needing more registers to sort of duplicate the architectural stack pointer (except that the SP never increases on pop).
Notice that this is like std::vector's push_back logic, where you have an allocation size and an in-use size. The difference is that std::vector always copies when it wants more space (because implementations fail to even try to realloc) so it has to amortize that by growing exponentially. When we know growth is O(1) by just moving the stack pointer, we can use a fixed increment. Like 128 bytes, or maybe half a page would make more sense. We're not touching memory at the bottom of the allocation immediately; I haven't tried compiling this for a target where stack probes are needed to make sure you don't move RSP by more than 1 page without touching intervening pages. MSVC might insert stack probes for this.
Hacked up alloca stack-on-the-callstack: full of UB and miscompiles in practice with gcc/clang
This mostly exists to show how evil it is, and that C is not a portable assembly language. There are things you can do in asm you can't do in C. (Also including efficiently returning multiple values from a function, in different registers, instead of a stupid struct.)
#include <alloca.h>
#include <stdlib.h>
void some_func(char);
// assumptions:
// stack grows down
// alloca is contiguous
// all the UB manages to work like portable assembly language.
// input assumptions: no mismatched { and }
// made up useless algorithm: if('}') total += distance to matching '{'
size_t brace_distance(const char *data)
{
size_t total_distance = 0;
volatile unsigned hidden_from_optimizer = 1;
void *stack_base = alloca(hidden_from_optimizer); // highest address. top == this means empty
// alloca(1) would probably be optimized to just another local var, not necessarily at the bottom of the stack frame. Like char foo[1]
static const int growth_chunk = 128;
size_t *stack_top = stack_base;
size_t *high_water = alloca(growth_chunk);
for (size_t pos = 0; data[pos] != '\0' ; pos++) {
some_func(data[pos]);
if (data[pos] == '{') {
//push_stack(stack, pos);
stack_top--;
if (stack_top < high_water) // UB: optimized away by clang; never allocs more space
high_water = alloca(growth_chunk);
// assert(high_water < stack_top && "stack growth happened somewhere else");
*stack_top = pos;
}
else if(data[pos] == '}')
{
//total_distance += pop_stack(stack);
size_t popped = *stack_top;
stack_top++;
total_distance += pos - popped;
// assert(stack_top <= stack_base)
}
}
return total_distance;
}
Amazingly, this seems to actually compile to asm that looks correct (on Godbolt), with gcc -O1 for x86-64 (but not at higher optimization levels). clang -O1 and gcc -O3 optimize away the if(top<high_water) alloca(128) pointer compare so this is unusable in practice.
< pointer comparison of pointers derived from different objects is UB, and it seems even casting to uintptr_t doesn't make it safe. Or maybe GCC is just optimizing away the alloca(128) based on the fact that high_water = alloca() is never dereferenced.
https://godbolt.org/z/ZHULrK shows gcc -O3 output where there's no alloca inside the loop. Fun fact: making volatile int growth_chunk to hide the constant value from the optimizer makes it not get optimized away. So I'm not sure it's pointer-compare UB that's causing the issue, it's more like accessing memory below the first alloca instead of dereferencing a pointer derived from the second alloca that gets compilers to optimize it away.
# gcc9.2 -O1 -Wall -Wextra
# note that -O1 doesn't include some loop and peephole optimizations, e.g. no xor-zeroing
# but it's still readable, not like -O1 spilling every var to the stack between statements.
brace_distance:
push rbp
mov rbp, rsp # make a stack frame
push r15
push r14
push r13 # save some call-preserved regs for locals
push r12 # that will survive across the function call
push rbx
sub rsp, 24
mov r12, rdi
mov DWORD PTR [rbp-52], 1
mov eax, DWORD PTR [rbp-52]
mov eax, eax
add rax, 23
shr rax, 4
sal rax, 4 # some insane alloca rounding? Why not AND?
sub rsp, rax # alloca(1) moves the stack pointer, RSP, by whatever it rounded up to
lea r13, [rsp+15]
and r13, -16 # stack_base = 16-byte aligned pointer into that allocation.
sub rsp, 144 # alloca(128) reserves 144 bytes? Ok.
lea r14, [rsp+15]
and r14, -16 # and the actual C allocation rounds to %16
movzx edi, BYTE PTR [rdi] # data[0] check before first iteration
test dil, dil
je .L7 # if (empty string) goto return 0
mov ebx, 0 # pos = 0
mov r15d, 0 # total_distance = 0
jmp .L6
.L10:
lea rax, [r13-8] # tmp_top = top-1
cmp rax, r14
jnb .L4 # if(tmp_top < high_water)
sub rsp, 144
lea r14, [rsp+15]
and r14, -16 # high_water = alloca(128) if body
.L4:
mov QWORD PTR [r13-8], rbx # push(pos) - the actual store
mov r13, rax # top = tmp_top completes the --top
# yes this is clunky, hopefully with more optimization gcc would have just done
# sub r13, 8 and used [r13] instead of this RAX tmp
.L5:
add rbx, 1 # loop condition stuff
movzx edi, BYTE PTR [r12+rbx]
test dil, dil
je .L1
.L6: # top of loop body proper, with 8-bit DIL = the non-zero character
movsx edi, dil # unofficial part of the calling convention: sign-extend narrow args
call some_func # some_func(data[pos]
movzx eax, BYTE PTR [r12+rbx] # load data[pos]
cmp al, 123 # compare against braces
je .L10
cmp al, 125
jne .L5 # goto loop condition check if nothing special
# else: it was a '}'
mov rax, QWORD PTR [r13+0]
add r13, 8 # stack_top++ (8 bytes)
add r15, rbx # total += pos
sub r15, rax # total -= popped value
jmp .L5 # goto loop condition.
.L7:
mov r15d, 0
.L1:
mov rax, r15 # return total_distance
lea rsp, [rbp-40] # restore stack pointer to point at saved regs
pop rbx # standard epilogue
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
This is like you'd do for a dynamically allocated stack data structure except:
it grows downward like the callstack
we get more memory from alloca instead of realloc. (realloc can also be efficient if there's free virtual address space after the allocation). C++ chose not to provide a realloc interface for their allocator, so std::vector always stupidly allocs + copies when more memory is required. (AFAIK no implementations optimize for the case where new hasn't been overridden and use a private realloc).
it's totally unsafe and full of UB, and fails in practice with modern optimizing compilers
the pages will never get returned to the OS: if you use a large amount of stack space, those pages stay dirty indefinitely.
If you can choose a size that's definitely large enough, you could use a VLA of that size.
I'd recommend starting at the top and going downward, to avoid touching memory far below the currently in-use region of the callstack. That way, on an OS that doesn't need "stack probes" to grow the stack by more than 1 page, you might avoid ever touching memory far below the stack pointer. So the small amount of memory you do end up using in practice might all be within an already mapped page of the callstack, and maybe even cache lines that were already hot if some recent deeper function call already used them.
If you do use the heap, you can minimize realloc costs by doing a pretty large allocation. Unless there was a block on the free-list that you could have gotten with a smaller allocation, in general over-allocating have very low cost if you never touch parts you didn't need, especially if you free or shrink it before doing any more allocations.
i.e. don't memset it to anything. If you want zeroed memory, use calloc which may be able to get zeroed pages from the OS for you.
Modern OSes use lazy virtual memory for allocations so the first time you touch a page, it typically has to page-fault and actually get wired into the HW page tables. Also a page of physical memory has to get zeroed to back this virtual page. (Unless the access was a read, then Linux will copy-on-write map the page to a shared physical page of zeros.)
A virtual page you never even touch will just be a larger size in an extent book-keeping data structure in the kernel. (And in the user-space malloc allocator). This doesn't add anything to the cost of allocating it, or to freeing it, or using the earlier pages that you do touch.

Not really a true answer, but a bit too long for a mere comment.
In fact, the image of the stack and the heap and growing each towards the other is over simplistic. It used to be true with the 8086 processor series (at least in some memory models) where the stack and the heap shared one single segment of memory, but even the old Windows 3.1 system came with some API functions allowing to allocate memory outside of the heap (search for GlobalAlloc opposited to LocalAlloc), provided the processor was at least a 80286.
But modern systems all use virtual memory. With virtual memory there is no longer a nice consecutive segment shared by the heap and the stack, and the OS can give memory wherever it want (provided of course it can find free memory somewhere). But the stack has still to be a consecutive segment. Because of that, the size of that segment is determinated at build time and is fixed, while the size of the heap is only constrained by the maximum memory the system can allocate to the process. That is the reason why many recommend to use the stack only for small data structures and always allocate large ones. Furthermore, I know no portable way for a program to know its stacksize, not speaking of its free stacksize.
So IMHO, what is important here is to wonder whether the size of your stack is small enough. If it can exceed a small limit, just go for allocated memory, because it will be both easier and more robust. Because you can (and should) test for allocation errors, but a stack overflow is always fatal.
Finally, my advice is not to try to use the system stack for your own dedicated usage, even if limited to one single function, except if you can cleanly ask for a memory array in the stack and build your own stack management over it. Using assembly language to directly use the underlying stack will add a lot of complexity (not speaking of lost portability) for a hypothetic minimal gain. Just don't. Even if you want to use assembly language instructions for a low level optimization of your stack, my advice is to use a dedicated memory segment and leave the system stack for the compiler.
My experience says that the more complexity you put into your code, the harder it will be to maintain, and the less robust.
So just follow best practices and only use low level optimizations when and where you cannot avoid them.

CPU Cache disadvantages of using linked lists in C

I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists.
https://en.wikipedia.org/wiki/Linked_list#Disadvantages
According to this article, the disadvantages are the following:
They use more memory than arrays because of the storage used by their pointers.
Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating.
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
I understand the first 3 points but I am having a hard time with the last one:
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
The article about CPU Cache does not mention anything about non contiguous memory arrays. As far as I know CPU Caches just caches frequently used adresses for a total 10^-6 cache miss.
Therefore, I do not understand why the CPU cache should be less efficient when it comes to non contiguous memory arrays.

CPU caches actually do two things.
The one you mentioned is caching recently used memory.
The other however is predicting which memory is going to be used in near future. The algorithm is usually quite simple - it assumes that the program processes big array of data and whenever it accesses some memory it will prefetch few more bytes behind.
This doesn't work for linked list as the nodes are randomly placed in memory.
Additionally, the CPU loads bigger blocks of memory (64, 128 bytes). Again, for the int64 array with single read it has data for processing 8 or 16 elements. For linked list it reads one block and the rest may be wasted as the next node can be in completely different chunk of memory.
And last but not least, related to previous section - linked list takes more memory for its management, the most simple version will take at least additional sizeof(pointer) bytes for the pointer to the next node. But it's not so much about CPU cache anymore.

The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.
One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.
So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).
Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n) in a linked list and O(1) in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1) time in a linked list, and O(n) time in an array, so the former wins in that scenario.
What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...
Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n) operation regardless if you use a linked or array representation. So it's a tie, right?
Not so fast! The actual performance can vary a lot! Here is what typical find() implementations would look like when compiled at -O2 optimization level in x86 gcc, thanks to godbolt which makes this easy.
Array
C Code
int find_array(int val, int *array, unsigned int size) {
for (unsigned int i=0; i < size; i++) {
if (array[i] == val)
return i;
}
return -1;
}
Assembly (loop only)1
.L6:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
je .done
add eax, 1
cmp edx, eax
jne .notfound
Linked List
C Code
struct Node {
struct Node *next;
int item;
};
Node * find_list(int val, Node *listptr) {
while (listptr) {
if (listptr->item == val)
return listptr;
listptr = listptr->next;
}
return 0;
}
Assembly (loop only)
.L20:
cmp DWORD PTR [rax+8], edi
je .done
mov rax, QWORD PTR [rax]
test rax, rax
jne .notfound
Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val and Node.next members, and a couple of comparisons.
The assembly seems to bear that out: the linked list version has 5 instructions and the array version2 has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.
If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr. The one line listptr = listptr->next boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next before you can calculate listptr->next->next). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.
The array version also has loads, but the address doesn't depend on the prior load:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
It depends only on rsi, which is simply calculated by adding 4 each iteration. An add has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.
This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.
1I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.
2It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi as the pointer into the array, and eax as the index i. This means two separate cmp instructions, and two increments. Better would have been to maintain only the pointer rsi in the loop, and to compare against (array + 4*size) as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp by having rsi run from -4*size up to zero, and indexing into array using [rdi + rsi] where rdi is array + 4*size. Shows that even today optimizing compilers aren't getting everything right!

CPU cache usually takes in a page of a certain size for example (the common one) 4096 bytes or 4kB and accesses information needed from there. To fetch a page there is a considerate amount of time consumed let's say 1000 cycles. If say we have an array of 4096 bytes which is contiguous we will fetch a 4096 bytes page from cache memory and probably most of the data will be there. If not maybe we need to fetch another page to get the rest of the data.
Example: We have 2 pages from 0-8191 and the array is in between 2048 and 6244 then we will fetch page#1 from 0-4095 to get the desired elements and then page#2 from 4096-8191 to get all array elements we want. This results in fetching 2 pages from memory to our cache to get our data.
What happens in a list though? In a list the data are non-contiguous which means that the elements are not in contiguous places in memory so they are probably scattered through various pages. This means that a CPU has to fetch a lot of pages from memory to the cache to get the desired data.
Example: Node#1 mem_address = 1000, Node#2 mem_address = 5000, Node#3 mem_address = 18000. If the CPU is able to see in 4k pages sizes then it has to fetch 3 different pages from memory to find the data it wants.
Also, the memory uses prefetch techniques to fetch pages of memory before they are needed so if the linked list is small let's say A -> B -> C, then the first cycle will be slow because the prefetcher can't predict the next block to fetch. But, on the next cycle we say that the prefetcher is warmed up and it can start predicting the path of the linked list and fetch the correct blocks on time.
Summarizing arrays are easily predictable by the hardware and are in one place so they are easy to fetch, while linked lists are unpredictable and are scattered throughout memory, which makes the life of the predictor and CPU harder.

BeeOnRope's answer is good and highlights the cycle count overheads of traversing a linked list vs iterating through an array, but as he explicitly says that's assuming "both lists fully resident in L1". However, it's far more likely that an array will fit better in L1 than a linked list, and the moment you start thrashing your cache the performance difference becomes huge. RAM can be more than 100x slower than L1, with L2 and L3 (if your CPU has any) being between 3x to 14x slower.
On a 64 bit architecture, each pointer takes 8 bytes, and a doubly linked list needs two of them or 16 bytes of overhead. If you only want a single 4 byte uint32 per entry, that means you need 5x as much storage for the dlist as you need for an array. Arrays guarantee locality, and although malloc can do OK at locality if you allocate stuff together in the right order, you often can't. Lets approximate poor locality by saying it takes 2x the space, so a dlist uses 10x as much "locality space" as an array. That's enough to push you from fitting in L1 to overflowing into L3, or even worse from L2 into RAM.

Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

Question:
Is accessing the stack the same speed as accessing memory?
For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.
So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)
Motivation for Question:
It is common in C to discourage trivial functions that take parameters.
I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.
But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.
(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)

Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.
Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.
Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).
If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.
Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.
I'd say there's no definitive answer for all cases because performance depends on:
your hardware
your compiler
your program and its memory accessing patterns

For the clock-cycle-curious...
For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).
You then get, on a Pentium 4 chip:
push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax
As far as the follow-on question in the comments (3rd comment):
indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.
Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.
(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)
Understanding Latency & Throughput
A bit more may be needed to understand timing tables for modern CPUs. These should help:
definitions of latency and throughput
a useful analogy for latency and throughput, and their relation to instruction processing pipelines)

C pointers vs direct member access for structs

Say I have a struct like the following ...
typedef struct {
int WheelCount;
double MaxSpeed;
} Vehicle;
... and I have a global variable of this type (I'm well aware of the pitfalls of globals, this is for an embedded system, which I didn't design, and for which they're an unfortunate but necessary evil.) Is it faster to access the members of the struct directly or through a pointer ? ie
double LocalSpeed = MyGlobal.MaxSpeed;
or
double LocalSpeed = pMyGlobal->MaxSpeed;
One of my tasks is to simplify and fix a recently inherited embedded system.

In general, I'd say go with the first option:
double LocalSpeed = MyGlobal.MaxSpeed;
This has one less dereference (you're not finding the pointer, then dereferencing it to get to it's location). It's also simpler and easier to read and maintain, since you don't need to create the pointer variable in addition to the struct.
That being said, I don't think any performance difference you'd see would be noticable, even on an embedded system. Both will be very, very fast access times.

The first one should be faster since it doesn't require pointer dereferencing. Then again thats true for x86 based systems, not sure for others.
on x86 the first one would translate to something like this
mov eax, [address of MyGlobal.MaxSpeed]
and the second one would be something like this
mov ebx, [address of pMyGlobal]
mov eax, [ebx+sizeof(int)]

On your embedded platform, it's likely that the architecture is optimized in such a way that it's essentially a wash, and even if it wasn't you would only ever notice a performance impact if this was executed in a very tight loop.
There are probably much more obvious performance areas of your system.

struct dataStruct
{
double first;
double second;
} data;
int main()
{
dataStruct* pData = &data;
data.first = 9.0;
pData->second = 10.0;
}
This is the assembly output using VS2008 release mode:
data.first = 9.0;
008D1000 fld qword ptr [__real#4022000000000000 (8D20F0h)]
pData->second = 10.0;
008D1006 xor eax,eax
008D1008 fstp qword ptr [data (8D3378h)]
008D100E fld qword ptr [__real#4024000000000000 (8D20E8h)]
008D1014 fstp qword ptr [data+8 (8D3380h)]

disassemble, disassemble, disassemble...
Depending on the lines of code you are not showing us it is possible that if your pointer is somewhat static a good compiler will know that and pre-compute the address for both. If you dont have optimizations on then this whole discussion is mute. It also depends on the processor you are using, both can be performed with a single instruction depending on the processor. So I follow the basic optimization steps:
1) disassemble and examine
2) time the execution
As mentioned above though the bottom line is it may be a case of two instructions instead of one costing a single clock cycle you would likely never see. The quality of your compiler and optimizer choices are going to make much more dramatic performance differences than trying to tweak one line of code in hopes of improving performance. Switching compilers can give you 10-20% in either direction, sometimes more. As can changing your optimization flags, turning everything on doesnt make the fastest code, sometimes -O1 performs better than -O3.
Understanding what those two lines of code produce and how to maximize performance from the high level language comes from compiling for different processors and disassembling using various compilers. And more importantly the code around the lines in question play a big role in how the compiler optimizes that segment.
Using someone else's example on this question:
typedef struct
{
unsigned int first;
unsigned int second;
} dataStruct;
dataStruct data;
int main()
{
dataStruct *pData = &data;
data.first = 9;
pData->second = 10;
return(0);
}
With gcc (not that great a compiler) you get:
mov r2, #10
mov r1, #9
stmia r3, {r1, r2}
So both lines of C code are joined into one store, the problem here is the example used as a test. Two separate functions would have been a little better but it needs a lot more code around it and the pointer needs to point at some other memory so the optimizer doesnt realize it is a static global address, to test this you need to pass the address in so the compiler (well gcc) cannot figure out that it is a static address.
Or with no optimizations, same code, same compiler, no difference between pointer and direct.
mov r3, #9
str r3, [r2, #0]
mov r3, #10
str r3, [r2, #4]
This is what you would expect to see depending on the compiler and processor, there may be no difference. For this processor even if the test code hid the static address for the pointer from the function it would still boil down to two instructions. If the value being stored in the structure element were already loaded in a register then it would be one instruction either way, pointer or direct.
So the answer to your question is not absolute...it depends. disassemble and test.

I suppose that, if this makes a difference at all, that would be architecture-dependent.

In general, accessing the struct directly would be quicker, as it won't require an extra pointer dereference. The pointer dereference means that it has to take the pointer (the thing in the variable), load whatever it points to, then operate on it.

In C, there should be no difference, or a insignificant performance hit.
C students are taught:
pMyGlobal->MaxSpeed == (*pMyGlobal).MaxSpeed
You should be able to compare the disassembly of them both to convince yourself that they are essentially the same, even if you aren't an Assembly-code programmer.
If you are looking for a performance optimization, I would look elsewhere. You won't be able to save enough CPU cycles with this kind of micro-optimization.
For stylistic reasons, I prefer the Structure-Dot notation, especially when dealing with singleton-globals. I find it much cleaner to read.

"register" keyword in C?

What does the register keyword do in C language? I have read that it is used for optimizing but is not clearly defined in any standard. Is it still relevant and if so, when would you use it?

It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
Most modern compilers do that automatically, and are better at picking them than us humans.

I'm surprised that nobody mentioned that you cannot take an address of register variable, even if compiler decides to keep variable in memory rather than in register.
So using register you win nothing (anyway compiler will decide for itself where to put the variable) and lose the & operator - no reason to use it.

It tells the compiler to try to use a CPU register, instead of RAM, to store the variable. Registers are in the CPU and much faster to access than RAM. But it's only a suggestion to the compiler, and it may not follow through.

I know this question is about C, but the same question for C++ was closed as a exact duplicate of this question. This answer therefore may not apply for C.
The latest draft of the C++11 standard, N3485, says this in 7.1.1/3:
A register specifier is a hint to the implementation that the variable so declared will be heavily used. [ note: The hint can be ignored and in most implementations it will be ignored if the address of the variable is taken. This use is deprecated ... —end note ]
In C++ (but not in C), the standard does not state that you can't take the address of a variable declared register; however, because a variable stored in a CPU register throughout its lifetime does not have a memory location associated with it, attempting to take its address would be invalid, and the compiler will ignore the register keyword to allow taking the address.

I have read that it is used for optimizing but is not clearly defined in any standard.
In fact it is clearly defined by the C standard. Quoting the N1570 draft section 6.7.1 paragraph 6 (other versions have the same wording):
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast
as possible. The extent to which such suggestions are effective is
implementation-defined.
The unary & operator may not be applied to an object defined with register, and register may not be used in an external declaration.
There are a few other (fairly obscure) rules that are specific to register-qualified objects:
Defining an array object with register has undefined behavior.
Correction: It's legal to define an array object with register, but you can't do anything useful with such an object (indexing into an array requires taking the address of its initial element).
The _Alignas specifier (new in C11) may not be applied to such an object.
If the parameter name passed to the va_start macro is register-qualified, the behavior is undefined.
There may be a few others; download a draft of the standard and search for "register" if you're interested.
As the name implies, the original meaning of register was to require an object to be stored in a CPU register. But with improvements in optimizing compilers, this has become less useful. Modern versions of the C standard don't refer to CPU registers, because they no longer (need to) assume that there is such a thing (there are architectures that don't use registers). The common wisdom is that applying register to an object declaration is more likely to worsen the generated code, because it interferes with the compiler's own register allocation. There might still be a few cases where it's useful (say, if you really do know how often a variable will be accessed, and your knowledge is better than what a modern optimizing compiler can figure out).
The main tangible effect of register is that it prevents any attempt to take an object's address. This isn't particularly useful as an optimization hint, since it can be applied only to local variables, and an optimizing compiler can see for itself that such an object's address isn't taken.

It hasn't been relevant for at least 15 years as optimizers make better decisions about this than you can. Even when it was relevant, it made a lot more sense on a CPU architecture with a lot of registers, like SPARC or M68000 than it did on Intel with its paucity of registers, most of which are reserved by the compiler for its own purposes.

Actually, register tells the compiler that the variable does not alias with
anything else in the program (not even char's).
That can be exploited by modern compilers in a variety of situations, and can help the compiler quite a bit in complex code - in simple code the compilers can figure this out on their own.
Otherwise, it serves no purpose and is not used for register allocation. It does not usually incur performance degradation to specify it, as long as your compiler is modern enough.

Storytime!
C, as a language, is an abstraction of a computer. It allows you to do things, in terms of what a computer does, that is manipulate memory, do math, print things, etc.
But C is only an abstraction. And ultimately, what it's extracting from you is Assembly language. Assembly is the language that a CPU reads, and if you use it, you do things in terms of the CPU. What does a CPU do? Basically, it reads from memory, does math, and writes to memory. The CPU doesn't just do math on numbers in memory. First, you have to move a number from memory to memory inside the CPU called a register. Once you're done doing whatever you need to do to this number, you can move it back to normal system memory. Why use system memory at all? Registers are limited in number. You only get about a hundred bytes in modern processors, and older popular processors were even more fantastically limited (The 6502 had 3 8-bit registers for your free use). So, your average math operation looks like:
load first number from memory
load second number from memory
add the two
store answer into memory
A lot of that is... not math. Those load and store operations can take up to half your processing time. C, being an abstraction of computers, freed the programmer the worry of using and juggling registers, and since the number and type vary between computers, C places the responsibility of register allocation solely on the compiler. With one exception.
When you declare a variable register, you are telling the compiler "Yo, I intend for this variable to be used a lot and/or be short lived. If I were you, I'd try to keep it in a register." When the C standard says compilers don't have to actually do anything, that's because the C standard doesn't know what computer you're compiling for, and it might be like the 6502 above, where all 3 registers are needed just to operate, and there's no spare register to keep your number. However, when it says you can't take the address, that's because registers don't have addresses. They're the processor's hands. Since the compiler doesn't have to give you an address, and since it can't have an address at all ever, several optimizations are now open to the compiler. It could, say, keep the number in a register always. It doesn't have to worry about where it's stored in computer memory (beyond needing to get it back again). It could even pun it into another variable, give it to another processor, give it a changing location, etc.
tl;dr: Short-lived variables that do lots of math. Don't declare too many at once.

You are messing with the compiler's sophisticated graph-coloring algorithm. This is used for register allocation. Well, mostly. It acts as a hint to the compiler -- that's true. But not ignored in its entirety since you are not allowed to take the address of a register variable (remember the compiler, now on your mercy, will try to act differently). Which in a way is telling you not to use it.
The keyword was used long, long back. When there were only so few registers that could count them all using your index finger.
But, as I said, deprecated doesn't mean you cannot use it.

Just a little demo (without any real-world purpose) for comparison: when removing the register keywords before each variable, this piece of code takes 3.41 seconds on my i7 (GCC), with register the same code completes in 0.7 seconds.
#include <stdio.h>
int main(int argc, char** argv) {
register int numIterations = 20000;
register int i=0;
unsigned long val=0;
for (i; i<numIterations+1; i++)
{
register int j=0;
for (j;j<i;j++)
{
val=j+i;
}
}
printf("%d", val);
return 0;
}

I have tested the register keyword under QNX 6.5.0 using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/neutrino.h>
#include <sys/syspage.h>
int main(int argc, char *argv[]) {
uint64_t cps, cycle1, cycle2, ncycles;
double sec;
register int a=0, b = 1, c = 3, i;
cycle1 = ClockCycles();
for(i = 0; i < 100000000; i++)
a = ((a + b + c) * c) / 2;
cycle2 = ClockCycles();
ncycles = cycle2 - cycle1;
printf("%lld cycles elapsed\n", ncycles);
cps = SYSPAGE_ENTRY(qtime) -> cycles_per_sec;
printf("This system has %lld cycles per second\n", cps);
sec = (double)ncycles/cps;
printf("The cycles in seconds is %f\n", sec);
return EXIT_SUCCESS;
}
I got the following results:
-> 807679611 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.244600
And now without register int:
int a=0, b = 1, c = 3, i;
I got:
-> 1421694077 cycles elapsed
-> This system has 3300830000 cycles per second
-> The cycles in seconds is ~0.430700

During the seventies, at the very beginning of the C language, the register keyword has been introduced in order to allow the programmer to give hints to the compiler, telling it that the variable would be used very often, and that it should be wise to keep it’s value in one of the processor’s internal register.
Nowadays, optimizers are much more efficient than programmers to determine variables that are more likely to be kept into registers, and the optimizer does not always take the programmer’s hint into account.
So many people wrongly recommend not to use the register keyword.
Let’s see why!
The register keyword has an associated side effect: you can not reference (get the address of) a register type variable.
People advising others not to use registers takes wrongly this as an additional argument.
However, the simple fact of knowing that you can not take the address of a register variable, allows the compiler (and its optimizer) to know that the value of this variable can not be modified indirectly through a pointer.
When at a certain point of the instruction stream, a register variable has its value assigned in a processor’s register, and the register has not been used since to get the value of another variable, the compiler knows that it does not need to re-load the value of the variable in that register.
This allows to avoid expensive useless memory access.
Do your own tests and you will get significant performance improvements in your most inner loops.
c_register_side_effect_performance_boost

Register would notify the compiler that the coder believed this variable would be written/read enough to justify its storage in one of the few registers available for variable use. Reading/writing from registers is usually faster and can require a smaller op-code set.
Nowadays, this isn't very useful, as most compilers' optimizers are better than you at determining whether a register should be used for that variable, and for how long.

gcc 9.3 asm output, without using optimisation flags (everything in this answer refers to standard compilation without optimisation flags):
#include <stdio.h>
int main(void) {
int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
#include <stdio.h>
int main(void) {
register int i = 3;
i++;
printf("%d", i);
return 0;
}
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
add ebx, 1
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
add rsp, 8
pop rbx
pop rbp
ret
This forces ebx to be used for the calculation, meaning it needs to be pushed to the stack and restored at the end of the function because it is callee saved. register produces more lines of code and 1 memory write and 1 memory read (although realistically, this could have been optimised to 0 R/Ws if the calculation had been done in esi, which is what happens using C++'s const register). Not using register causes 2 writes and 1 read (although store to load forwarding will occur on the read). This is because the value has to be present and updated directly on the stack so the correct value can be read by address (pointer). register doesn't have this requirement and cannot be pointed to. const and register are basically the opposite of volatile and using volatile will override the const optimisations at file and block scope and the register optimisations at block-scope. const register and register will produce identical outputs because const does nothing on C at block-scope, so only the register optimisations apply.
On clang, register is ignored but const optimisations still occur.

On supported C compilers it tries to optimize the code so that variable's value is held in an actual processor register.

Microsoft's Visual C++ compiler ignores the register keyword when global register-allocation optimization (the /Oe compiler flag) is enabled.
See register Keyword on MSDN.

Register keyword tells compiler to store the particular variable in CPU registers so that it could be accessible fast. From a programmer's point of view register keyword is used for the variables which are heavily used in a program, so that compiler can speedup the code. Although it depends on the compiler whether to keep the variable in CPU registers or main memory.

Register indicates to compiler to optimize this code by storing that particular variable in registers then in memory. it is a request to compiler, compiler may or may not consider this request.
You can use this facility in case where some of your variable are being accessed very frequently.
For ex: A looping.
One more thing is that if you declare a variable as register then you can't get its address as it is not stored in memory. it gets its allocation in CPU register.

The register keyword is a request to the compiler that the specified variable is to be stored in a register of the processor instead of memory as a way to gain speed, mostly because it will be heavily used. The compiler may ignore the request.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight