Say I have a struct like the following ...
typedef struct {
int WheelCount;
double MaxSpeed;
} Vehicle;
... and I have a global variable of this type (I'm well aware of the pitfalls of globals, this is for an embedded system, which I didn't design, and for which they're an unfortunate but necessary evil.) Is it faster to access the members of the struct directly or through a pointer ? ie
double LocalSpeed = MyGlobal.MaxSpeed;
or
double LocalSpeed = pMyGlobal->MaxSpeed;
One of my tasks is to simplify and fix a recently inherited embedded system.
In general, I'd say go with the first option:
double LocalSpeed = MyGlobal.MaxSpeed;
This has one less dereference (you're not finding the pointer, then dereferencing it to get to it's location). It's also simpler and easier to read and maintain, since you don't need to create the pointer variable in addition to the struct.
That being said, I don't think any performance difference you'd see would be noticable, even on an embedded system. Both will be very, very fast access times.
The first one should be faster since it doesn't require pointer dereferencing. Then again thats true for x86 based systems, not sure for others.
on x86 the first one would translate to something like this
mov eax, [address of MyGlobal.MaxSpeed]
and the second one would be something like this
mov ebx, [address of pMyGlobal]
mov eax, [ebx+sizeof(int)]
On your embedded platform, it's likely that the architecture is optimized in such a way that it's essentially a wash, and even if it wasn't you would only ever notice a performance impact if this was executed in a very tight loop.
There are probably much more obvious performance areas of your system.
struct dataStruct
{
double first;
double second;
} data;
int main()
{
dataStruct* pData = &data;
data.first = 9.0;
pData->second = 10.0;
}
This is the assembly output using VS2008 release mode:
data.first = 9.0;
008D1000 fld qword ptr [__real#4022000000000000 (8D20F0h)]
pData->second = 10.0;
008D1006 xor eax,eax
008D1008 fstp qword ptr [data (8D3378h)]
008D100E fld qword ptr [__real#4024000000000000 (8D20E8h)]
008D1014 fstp qword ptr [data+8 (8D3380h)]
disassemble, disassemble, disassemble...
Depending on the lines of code you are not showing us it is possible that if your pointer is somewhat static a good compiler will know that and pre-compute the address for both. If you dont have optimizations on then this whole discussion is mute. It also depends on the processor you are using, both can be performed with a single instruction depending on the processor. So I follow the basic optimization steps:
1) disassemble and examine
2) time the execution
As mentioned above though the bottom line is it may be a case of two instructions instead of one costing a single clock cycle you would likely never see. The quality of your compiler and optimizer choices are going to make much more dramatic performance differences than trying to tweak one line of code in hopes of improving performance. Switching compilers can give you 10-20% in either direction, sometimes more. As can changing your optimization flags, turning everything on doesnt make the fastest code, sometimes -O1 performs better than -O3.
Understanding what those two lines of code produce and how to maximize performance from the high level language comes from compiling for different processors and disassembling using various compilers. And more importantly the code around the lines in question play a big role in how the compiler optimizes that segment.
Using someone else's example on this question:
typedef struct
{
unsigned int first;
unsigned int second;
} dataStruct;
dataStruct data;
int main()
{
dataStruct *pData = &data;
data.first = 9;
pData->second = 10;
return(0);
}
With gcc (not that great a compiler) you get:
mov r2, #10
mov r1, #9
stmia r3, {r1, r2}
So both lines of C code are joined into one store, the problem here is the example used as a test. Two separate functions would have been a little better but it needs a lot more code around it and the pointer needs to point at some other memory so the optimizer doesnt realize it is a static global address, to test this you need to pass the address in so the compiler (well gcc) cannot figure out that it is a static address.
Or with no optimizations, same code, same compiler, no difference between pointer and direct.
mov r3, #9
str r3, [r2, #0]
mov r3, #10
str r3, [r2, #4]
This is what you would expect to see depending on the compiler and processor, there may be no difference. For this processor even if the test code hid the static address for the pointer from the function it would still boil down to two instructions. If the value being stored in the structure element were already loaded in a register then it would be one instruction either way, pointer or direct.
So the answer to your question is not absolute...it depends. disassemble and test.
I suppose that, if this makes a difference at all, that would be architecture-dependent.
In general, accessing the struct directly would be quicker, as it won't require an extra pointer dereference. The pointer dereference means that it has to take the pointer (the thing in the variable), load whatever it points to, then operate on it.
In C, there should be no difference, or a insignificant performance hit.
C students are taught:
pMyGlobal->MaxSpeed == (*pMyGlobal).MaxSpeed
You should be able to compare the disassembly of them both to convince yourself that they are essentially the same, even if you aren't an Assembly-code programmer.
If you are looking for a performance optimization, I would look elsewhere. You won't be able to save enough CPU cycles with this kind of micro-optimization.
For stylistic reasons, I prefer the Structure-Dot notation, especially when dealing with singleton-globals. I find it much cleaner to read.
Related
Given the following tiny program:
#include <stdlib.h>
#define NEXT goto **ip++
#define guard(n) asm("#" #n)
int main() {
static void *prog[] = {&&next1,&&next2,&&next1,&&next3,&&next1,&&next4,&&next1,&&next5,&&next1,&&loop};
void ** ip=prog;
int count = 100000000;
NEXT;
next1: guard(1); NEXT;
next2: guard(2); NEXT;
next3: guard(3); NEXT;
next4: guard(4); NEXT;
next5: guard(5); NEXT;
loop:
if (count) {
count--;
ip=prog;
NEXT;
}
exit(0);
}
I noticed that each of the next# statements get compiled as TWO instructions.
ldr r2, [r3], #4
mov pc, r2 # indirect register jump
I would have expected this to only need one instruction:
ldr pc, [r3], #4
I found the discussion here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40887
"The problem is that the instruction "ldr pc, [r3, #0]" is not considered a function call by the Cortex-A8's branch predictor, as noted in DDI0344J section 5.2.1, Return stack predictions. Thus, the return from the called function is mispredicted resulting in a penalty of 13 cycles compared to a direct call."
However, the 'goto' is not a function call and there's no expectation for the return stack to be relevant here at all.
I'm wondering if this is some optimization that both GCC and CLANG have missed or is the latter a worse performer for a reason I'm un aware of?
This looks like a missed optimization unless there is some other microarchitectural reason to avoid it on some other CPUs. (That's plausible, but I wouldn't specifically expect it. Loading into a register multiple instructions earlier could give some time to hide load-use latency and reduce possible mispredict penalty, but loading in the previous instruction is unlikely to matter unless there's something special about ldr into PC.)
You're correct, bug #40887 is only about indirect calls with blx vs. manually setting up a return address and jumping. It's not relevant to indirect jumps inside a function, like for a switch or computed goto. (Except perhaps if GCC is avoiding loads into PC in general, so this missed optimization is collateral damage from fixing that bug.)
And you're not using volatile, another thing that often makes GCC do a load with a separate instruction instead of folding it into something else (like avoiding x86 add eax, [rdi]. Or in this case a memory-source jump like ARM load-into-PC, it might consider that special.)
Comparing -mcpu=cortex-a8 -marm vs. -mthumb, we see GCC does need extra instructions in thumb mode to set the low bit of the target address before mov pc,reg. https://godbolt.org/z/87EszvWP1
Or maybe that's a missed optimization, too: just ldr pc, [mem] would stay in the current mode, and we know we're jumping within a single function so there's no possibility of changing mode. And/or the jump table could just have been built with the low bits already set if using bx r2 is actually faster.
https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/ldr--register-offset- says
For word loads, Rt can be the PC. A load to the PC causes a branch to the address loaded. In ARMv4, bits[1:0] of the address loaded must be 0b00. In ARMv5T and above, bits[1:0] must not be 0b10, and if bit[0] is 1, execution continues in Thumb state, otherwise execution continues in ARM state.
In Thumb mode, ldr into PC is only possible with a 32-bit instruction, but ldr into r0-7 and branching to it can each be 16-bit instructions. But I doubt that would be any better unless you can schedule the load earlier.
When you use the -O0 compiler flag in C, you tell the compiler to avoid any kind of optimization. When you define a variable as volatile, you tell the compiler to avoid optimizing that variable. Can we use the two approaches interchangeably? And if so what are the pros and cons? Below are some pros and cons that I can think of. Are there any more?
Pros:
Using the -O0 flag is helpful if we have a big code base inside which the variables that should have been declared as volatile, are not. If the code is showing buggy behavior, instead of going in the code and finding which variables need to be declared as volatile, we can just use the -O0 flag to eliminate the possibility that optimization is causing the problem.
Cons:
The -O0 flag will affect the entire code while the volatile keyword only affects a specific variable. If we're working on a small microcontroller for example, this could be a problem since using -O0 may produce a big executable.
The short answer is: the volatile keyword does not mean "do not optimize". It is something completely different. It informs the compiler that the variable may be changed by something which is not visible for the compiler in the normal program flow. For example:
It can be changed by the hardware - usually registers mapped in the memory address space
Can be changed by the function which is never called - for example the interrupt routine
Variable can be changed by another process or hardware - for example shared memory in the multiprocessor / multicore systems
The volatile variable has to be read from its storage location every time it is used, and saved every time it was changed.
Here you have an example:
int foo(volatile int z)
{
return z + z + z + z;
}
int foo1(int z)
{
return z + z + z + z;
}
and the resulting code (-O0 optimization option)
foo(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-4]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
pop rbp
ret
foo1(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
sal eax, 2
pop rbp
ret
The difference is obvious I think. The volatile variable is read 4 times, non volatile is read once, then multiplied by 4.
You can play yourself here: https://godbolt.org/g/RiTU4g
In the most cases if the program does not run when you turn on the compiler optimization, you have some hidden UBs in your code. You should debug as long as needed to discover all of them. The correctly written program must run at any optimization level.
Bear in mind that `volatile' does not mean or guarantee the coherency & atomicity.
Compiler flag -O0 is in no way a replacement for proper use of volatile, because the code that does not work when it is properly optimized by the compiler is inherently broken. You do not want a broken code giving you an appearance of "working" until someone forgets to throw the -O0 switch.
It is unusual even for large code bases to have a need for many volatile variables, in terms of the total percentage of variables in the code. Fixing a large code base with missing volatile is likely to require finding a few strategic places where multiple variables need to be volatile, and fixing just these few, rather than taking a "shotgun approach" and disabling all optimizations.
Using the -O0 flag is helpful if we have a big code base inside which the variables that should have been declared as volatile, are not
You could use O0 to debug and fix the problems in such cases.
If the code is showing buggy behavior, instead of going in the code and finding which variables need to be declared as volatile, we can just use the -O0 flag to eliminate the possibility that optimization is causing the problem.
That's a wrong conclusion. There's no guarantee that O0 "fixes" the problem due to some variable(s) missing volatile qualifier. The problem still exists in your code and needs to be fixed.
You seem to have misunderstood volatile. It's not something that controls compiler optimisation per se. Whereas O0 typically disables most optimisations (compiler can still optimize though).
In conclusion, no, they are totally different, serving different purposes. As such, there's no question of using one over other or using interchangeably.
There's no reason to disable compiler optimisations. You need to fix the problem in your code i.e, add volatile qualifiers to variable(s) that require it.
The existing answers already cover volatile pretty well, but I believe the root cause of this question has nothing to do with volatile.
If your code works with -O0 but doesn't with optimizations enabled, you may have a wide variety of bugs in your code, or it is also possible that the compiler is buggy. This being tagged "microcontroller", I wouldn't rule out compiler bugs.
It's possible that you have a buffer overrun or underrun, for example, and the optimizer simply arranges your code in a slightly different way which exposes the bug. Try running your code through a static code analyzer (such as cppcheck or llvm's static code analysis). Whether that's a feasible option depends on how microcontroller-specific your code is, though.
Finally, depending on the compiler, -O0 might still generate code that keeps some value in a register for a while unless volatile is used, so I wouldn't call -O0 a replacement for volatile in any case. (That's compiler specific naturally).
There are 2 ways of allocating global array in C:
statically
char data[65536];
dynamically
char *data;
…
data = (char*)malloc(65536); /* or whatever size */
The question is, which method has better performance? And by how much?
As understand it, the first method should be faster.
Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:
read the variable data which contains the pointer to the beginning of the array
calculate the offset to specific element
access the element
With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:
calculate the offset to specific element from fixed address defined at compile time
access the element of the array
Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.
Is this correct?
One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)
If the access is sporadic the static/global will be slightly faster:
// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{
data_p[idx] = val;
}
// faster because we can store directly
void
datasets(int idx,char val)
{
data_s[idx] = val;
}
Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.
However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:
void
datasetalls(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_s[idx] = val;
}
void
datasetallp(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_p[idx] = val;
}
void
datasetallp_optimized(char val)
{
int idx;
register char *reg;
// the optimizer will generate the equivalent code to this
reg = data_p;
for (idx = 0; idx < 65536; ++idx)
reg[idx] = val;
}
If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.
If such eviction occurs, the actual data array will, most likely, be evicted as well.
A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.
But, in that case, the extra fetch of data_p would be 0.000001% overhead.
So, it helps to either benchmark [or model] the particular use case/access pattern.
UPDATE:
Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.
Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.
If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.
Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:
// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;
// closer to real world
struct mystruct {
...
char *data_p;
...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;
These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].
For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".
Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:
// does fetch of data_p once at loop start
int
datasumallp(void)
{
int idx;
int sum;
sum = 0;
for (idx = 0; idx < 65536; ++idx)
sum += data_p[idx];
return sum;
}
But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.
Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.
void
dataincallx(array_t *data,int val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data[idx] = val + idx;
}
This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].
So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.
It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.
This is use case dependent.
If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.
However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.
Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x; and then no matter where data is stored, the program has to load a base address and calculate an offset.
The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x; which is probably a far less likely use case.
At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.
Each memory access is equivalent to about 40 CPU clock cycles
What!? What CPU is that? Some pre-ancient computer from 1960?
Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.
There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free() it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.
I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)
But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.
There is a small effect here. It's unlikely to be significant, but it is real. It will often take one extra instruction to resolve the extra level of indirection for a global pointer-to-a-buffer instead of a global array. For most uses, other considerations will be more important (like reuse of the same scratch space, vs giving each function its own scratch buffer). Also: avoiding compile-time size limits!
This effect is only present when you reference the global directly, rather than passing around the address as a function parameter. Inlining / whole-program link-time optimization may see all the way back to where the global is used as a function arg initially, and be able to take advantage of it throughout more code, though.
Let's compare simple test functions, compiled by clang 3.7 for x86-64 (SystemV ABI, so the first arg is in rdi). Results on other architectures will be essentially the same:
int data_s[65536];
int *data_p;
void store_s(int val) { data_s[val] = val; }
movsxd rax, edi ; sign-extend
mov dword ptr [4*rax + data_s], eax
ret
void store_p(int val) { data_p[val] = val; }
movsxd rax, edi
mov rcx, qword ptr [rip + data_p] ; the extra level of indirection
mov dword ptr [rcx + 4*rax], eax
ret
Ok, so there's overhead of one extra load. (mov r64, [rel data_p]). Depending on what other static/global objects data_p is stored near, it may tend to stay hot in cache even if we're not using it often. If it's in a cache line with no other frequently-accessed data, it's wasting most of that cache line, though.
The overhead is only paid once per function call, though, even if there's a loop. (Unless the array is an array of pointers, since C aliasing rules require the compiler to assume that any pointer might be pointing to data_p, unless it can prove otherwise. This is the main performance danger when using global pointers-to-pointers.)
If you don't use restrict, the compiler still has to assume that the buffers pointed to by int *data_p1 and int *data_p2 could overlap, though, which interferes with autovectorization, loop unrolling, and many other optimizations. Static buffers can't overlap with other static buffers, but restrict is still needed when using a static buffer and a pointer in the same loop.
Anyway, let's have a look at the code for very simple memset-style loops:
void loop_s(int val) { for (int i=0; i<65536; ++i) data_s[i] = val; }
mov rax, -262144 ; loop counter, counting up towards zero
.LBB3_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + data_s+262144], edi
add rax, 4
jne .LBB3_1
ret
Note that clang uses a non-RIP-relative effective address for data_s here, because it can.
void loop_p(int val) { for (int i=0; i<65536; ++i) data_p[i] = val; }
mov rax, qword ptr [rip + data_p]
xor ecx, ecx
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + 4*rcx], edi
add rcx, 1
cmp rcx, 65536
jne .LBB4_1
ret
Note the mov rax, qword [rip + data_p] in loop_p, and the less efficient loop structure because it uses the loop counter as an array index.
gcc 5.3 has much less difference between the two loops: it gets the start address into a register and increments it, with a compare against the end address. So it uses a one-register addressing mode for the store, which may perform better on Intel CPUs. The only difference in loop structure / overhead for gcc is that the static buffer version gets the initial pointer into a register with a mov r32, imm32, rather than a load from memory. (So the address is an immediate constant embedded in the instruction stream.)
In shared-library code, and on OS X, where all executables must be position-independent, gcc's way is the only option. Instead of mov r32, imm32 to get the address into a register, it would use lea r64, [RIP + displacement]. The opportunity to save an instruction by embedding the absolute address into other instructions is gone when you need to offset the address by a variable amount (e.g. array index). If the array index is a compile-time constant, it can be folded into the displacement for a RIP-relative load or store instead of LEA. For a loop over an array, this could only happen with full unrolling, and is thus unlikely.
Still, the extra level of indirection is still there with a pointer to dynamically allocated memory. There's still a chance of a cache or TLB miss when doing a load instead of an LEA.
Note that global data (as opposed to static) has an extra level of indirection through the global offset table, but that's on top of the indirection or lack thereof from dynamic allocation. compiling with gcc 5.3 -fPIC.
int global_data_s[65536];
int access_global_s(int i){return global_data_s[i];}
mov rax, QWORD PTR global_data_s#GOTPCREL[rip] ; load from a RIP-relative address, instead of an LEA
movsx rdi, edi
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
int *global_data_p;
int access_global_p(int i){return global_data_p[i];}
mov rax, QWORD PTR global_data_p#GOTPCREL[rip] ; extra layer of indirection through the GOT
movsx rdi, edi
mov rax, QWORD PTR [rax] ; load the pointer (the usual layer of indirection)
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
If I understand this correctly, the compiler doesn't assume that the symbol definition for global symbols in the current compilation unit are the definitions that will actually be used at link time. So the RIP-relative offset isn't a compile-time constant. Thanks to runtime dynamic linking, it's not a link-time constant either, so an extra level of indirection through the GOT is used. This is unfortunate, and I hope compilers on OS X don't have this much overhead for global variables. With -O0 -fwhole-program on godbolt, I can see that even the globals are accessed with just RIP-relative addressing, not through the GOT, so perhaps that option is even more valuable than usual when making position-independent executables. Hopefully link-time-optimization works too, because that could be used when making shared libraries.
As many other answers have pointed out, there are other important factors, like memory locality, and the overhead of actually doing the allocate/free. These don't matter much for a large buffer (multiple pages) that's allocated once at program startup. See the comments on Peter A. Schneider's answer.
Dynamic allocation can give significant benefits, though, if you end up using the same memory as scratch space for multiple different things, so it stays hot in cache. Giving each function its own static buffer for scratch space is often a bad move if they aren't needed simultaneously: the dirty memory has to get written back to main memory when it's no longer needed, and is part of the program's footprint forever.
A good way to get small scratch buffers without the overhead of malloc (or new) is to create them on the stack (e.g. as local array variables). C99 allows variable-sized local arrays, like foo(int n) { int buf[n]; ...; } Be careful not to overdo it and run out of stack space, but the current stack page is going to be hot in the TLB. The _local functions in my godbolt links allocate a variable-sized array on the stack, which has some overhead for re-aligning the stack to a 16B boundary after adding a variable size. It looks like clang takes care to mask off the sign bit, but gcc's output looks like it will just break in fun and exciting ways if n is negative. (In godbolt, use the "binary" button to get disassembler output, instead of the compiler's asm output, because the disassembly uses hex for immediate constants. e.g. clang's movabs rcx, 34359738352 is 0x7fffffff0). Even though it takes a few instructions, it's much cheaper than malloc. A medium to large allocation with malloc, like 64kiB, will typically make an mmap system call. But this is the cost of allocation, not the cost of accessing once allocated.
Having the buffer on the stack means the stack pointer itself is the base address for indexing into it. This means it doesn't take an extra register to hold that pointer, and it doesn't have to be loaded from anywhere.
If any elements are statically initialized to non-zero in a static (or global), the entire array or struct will be sitting there in the executable, which is a big waste of space if most entries should be zero at program startup. (Or if the data can be computed on the fly quickly.)
On some systems, having a gigantic array zero-initialized static array doesn't cost you anything as long as you never even read the parts you don't need. Lazy mapping of memory means the OS maps all the pages of your giant array to the same page of zeroed memory, and does copy-on-write. Taking advantage of this would be an ugly performance hack to be used only if you were sure you really wanted it, and were sure your code would never run on a system where your char data[1<<30] would actually use that much memory right away.
Each memory access is equivalent to about 40 CPU clock cycles.
This is nonsense. The latency can be anywhere from 3 or 4 cycles (L1 cache hit) to multiple hundreds of cycles (main memory), or even a page fault requiring a disk access. Other than a page fault, much of this latency can overlap with other work, so the impact on throughput can be much lower. A load from a constant address can start as soon as the instruction issues into the out-of-order core, since it's the start of a new dependency chain. The size of the out-of-order window is limited (an Intel Skylake core has a Re-Order Buffer of 224 uops, and can have 72 loads in flight at once). A full cache miss (or worse, a TLB miss followed by a cache miss) often does stall out-of-order execution. See http://agner.org/optimize/, and other links in the x86 wiki. Also see this blog post about the impact of ROB size on how many cache misses can be in flight at once.
Out-of-order cores for other architectures (like ARM and PPC) are similar, but in-order cores suffer more from cache misses because they can't do anything else while waiting. (Big x86 cores like Intel and AMD's mainstream microarchitectures (not the low-power Silvermont or Jaguar microarchitectures) have more out-of-order execution resources than other designs, though. AFAIK, most ARM cores have significantly smaller buffers for starting independent loads early and/or hiding cache-miss latency.)
I would say you really should profile it. Theoretically you are right but there are some basic things you have to remember.
Language C is a high-level language like many there exist today and you tell the machine what to do. Getting closer to machine code would be considering ASM or similar. If you build code, through compiling and linking or whatever, the compiler will try the best to correctly run what you demand and optimize it (unless you don't want that). Remember, there also exist concepts like Just-In-Time compilation (JIT).
So I consider it hard to answer your question. For one thing you can be sure. A static array will most likely be faster especially with the size of 65536 because there are more chances of optimization for the compiler. This might depend on what size you defined. For GCC 65536 bytes seems to be common for stacks and caches, not sure. Some compilers might even tell you the array is too big, because they try to keep it in other memory hierarchies like caches which also are faster than Random Access Memory.
Last but not least remember that modern operating systems also have their memory management using virtual memory.
Static memory can be stored in data segments and will most likely be loaded when the program is executed, but remember this is also time you have to consider. Allocate the memory by the OS when the program is started or do it at runtime? It really depends on your application.
So I think you really should benchmark your results and see by how much faster it is. But as tendency I would say your static array will compile a code that is going to run faster.
Does using lots of global variables in C code decrease or increase performance, when compiled for an ARM7 embedded platform?
The code base consists of multiple C source code files which refer each other's global variables using the extern keyword. Different functions from different source code files refer to different global variables. Some of the variables are arrays.
The compiler I'm using is IAR's EW ARM kickstart edition (32kb).
This will always decrease performance and increase program size versus static variables. Your question doesn't specifically ask what you are comparing to. I can see various alternatives,
Versus static variables.
Versus parameters passed by value.
Versus values in a passed array or structure pointers.
The ARM blog gives specifics on how to load a constant to an arm register. This step must always be done to get the address of a global variable. The compiler will not know a prior how far away a global is. If you use gcc with -lto or use something like whole-program, then better optimizations can be performed. Basically, these will transform the global to a static.
Here a compiler may keep a register with the address of a global base and then different variables are loaded with an offset; such as ldr rN, [rX, #offset]. That is, if you are lucky.
The design of RISC CPUs, like ARM support a load/store unit which handles all memory accesses. Typically, the load/store instructions are capable of the [register + offset] form. Also, all RISC registers are approximately symmetric. Meaning any register can be used for this offset access. Typically, if you pass a struct or array pointer as a parameter, then it becomes the same thing. Ie, ldr rN, [rX, #offset].
Now, the advantage of the parameter is that eventually, your routines can support multiple arrays or structures by passing different pointers. Also, it gives you the advantage to group common data together which gives cache benefits.
I would argues that globals are detrimental on the ARM. You should only use global pointers, where your code needs a singleton. Or you have some sort of synchronization memory. Ie, globals should only be use for global functionality and not for data.
Passing all of the values via the stack is obviously in-efficient and misses the value of a memory reference or pointer.
Well, using global variables does not impact CPU performance directly. Stack allocation is typically a single add or subtract at function entry/exit respectively.
However, the stack is very limited in size. Using dynamic allocation on the heap is typically the solution. In embedded systems, this may be a problem because of how long it may take to allocate or free dynamic memory.
If allocating and freeing from the heap is a problem for your system, global variables may alleviate the problem of allocation/free execution time.
I wouldn't recommend this as your first solution — especially if this application involves threading. It may be difficult to track down which threads/functions are modifying global variables, leading to future headaches. static variables are technically placed in the same location as global variables ("global and static data"), so you may want to consider this option first.
Any performance benefit or otherwise would depend entirely on the access pattern and usage, so it is not possible to state in an individual case without seeing the code. The code may be efficient or inefficient regardless of the use of globals.
If by making the data global, you avoid function calls to accessors functions for example, and such accesses are frequent, then avoiding the function call overhead may have a measurable performance advantage. But simply being global in and of itself will not have any advantage - its about the method of access and the number of instructions that generates (or wait states if the memory accesses is slower than the processor - off-chip memory for example - but that applies to any data, global or otherwise).
The use of globals in the manner you describe is usually indicative of poor design and/or developer inexperience, and there are likely to be areas of the code that have a far greater impact on performance than mere locality of data access.
In the end the use of global data to gain some perceived performance advantage is ill-conceived. Performance in most cases should be about achieving required real-time deadlines or data-throughput, not about being as fast as possible; if your processor ends up idling 90% f the time, all you have achieved is more time to do nothing.
I suspect your code-base uses global data more out of poor design or workmanship more any deliberate performance concerns. Encapsulated static data with explicitly in-lined or compiler-optimised access functions is likely have similar performance while being more maintainable and easier to debug - advantages that probably far outweigh the performance issues. Ask yourself whether it will be better to save a millisecond of CPU time or a month of development time, or worse a product recall and loss of customers because your product fails in the field.
You are probably worrying about something that's not really a problem for you however...
From theoretical or nitpicking point of view, accessing global variables require some kind of redirection (like GOT for PIC), thus they are slower to access.
When you are accessing variables in local scope, you are implicitly using local references like your stack pointer or values laying in registers, so accessing them is faster.
For example:
extern int x;
int foo(int a, int b, int c, int d, int e) {
return x + b + e;
}
compiles to
foo(int, int, int, int, int):
movw r3, #:lower16:x
movt r3, #:upper16:x
ldr r3, [r3, #0]
adds r0, r1, r3
ldr r3, [sp, #0]
adds r0, r0, r3
bx lr
You can see accessing b (r1) or e (ldr r3, [sp, #0]) requires less instructions compared to accessing x (movw r3, #:lower16:x; movt r3, #:upper16:x; ldr r0, [r3, #0]).
When I read this question I remembered someone once telling me (many years ago) that from an assembler-point-of-view, these two operations are very different:
n = 0;
n = n - n;
Is this true, and if it is, why is it so?
EDIT: As pointed out by some replies, I guess this would be fairly easy for a compiler to optimize into the same thing. But what I find interesting is why they would differ if the compiler had a completely general approach.
Writing assembler code you often used:
xor eax, eax
instead of
mov eax, 0
That is because with the first statement you have only the opcode and no involved argument. Your CPU will do that in 1 cylce (instead of 2). I think your case is something similar (although using sub).
Compiler VC++ 6.0, without optimisations:
4: n = 0;
0040102F mov dword ptr [ebp-4],0
5:
6: n = n - n;
00401036 mov eax,dword ptr [ebp-4]
00401039 sub eax,dword ptr [ebp-4]
0040103C mov dword ptr [ebp-4],eax
In the early days, memory and CPU cycles were scarce. That lead to a lot of so called "peep-hole optimizations". Let's look at the code:
move.l #0,d0
moveq.l #0,d0
sub.l a0,a0
The first instruction would need two bytes for the op-code and then four bytes for the value (0). That meant four bytes wasted plus you'd need to access the memory twice (once for the opcode and once for the data). Sloooow.
moveq.l was better since it would merge the data into the op-code but it only allowed to write values between 0 and 7 into a register. And you were limited to data registers only, there was no quick way to clear an address register. You'd have to clear a data register and then load the data register into an address register (two op-codes. Bad.).
Which lead to the last operation which works on any register, need only two bytes, a single memory read. Translated into C, you'd get
n = n - n;
which would work for most often used types of n (integer or pointer).
An optimizing compiler will produce the same assembly code for the two.
It may depend on whether n is declared as volatile or not.
The assembly-language technique of zeroing a register by subtracting it from itself or XORing it with itself is an interesting one, but it doesn't really translate to C.
Any optimising C compiler will use this technique if it makes sense, and trying to write it out explicitly is unlikely to achieve anything.
In C they only differ (for integer types) if your compiler sucks (or you disabled optimization like an MSVC answer shows).
Perhaps the person who told you this way trying to describe an asm instruction like sub reg,reg using C syntax, not talking about how such a statement would actually compile with a modern optimizing compiler? In which case I wouldn't say "very different" for most x86 CPUs; most do special case sub same,same as a zeroing idiom, like xor same,same. What is the best way to set a register to zero in x86 assembly: xor, mov or and?
That makes an asm sub reg,reg similar to mov reg,0, with somewhat better code size. (But yes, some unique benefits wrt. partial-register renaming on Intel P6-family that you can only get from zeroing idioms, not mov).
They could differ in C if your compiler is trying to implement the mostly-deprecated memory_order_consume semantics from <stdatomic.h> on a weakly-ordered ISA like ARM or PowerPC, where n=0 breaks the dependency on the old value but n = n-n; still "carries a dependency", so a load like array[n] will be dependency-ordered after n = atomic_load_explicit(&shared_var, memory_order_consume). See Memory order consume usage in C11 for more details
In practice compilers gave up on trying to get that dependency-tracking right and promote consume loads to acquire. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html and When should you not use [[carries_dependency]]?
But in asm for weakly-ordered ISAs, sub dst, same, same is required to stil carry a dependency on the input register, just like in C. (Most weakly-ordered ISAs are RISCs with fixed-width instructions so avoiding an immediate operand doesn't make the machine code any smaller. Thus there is no historical use of shorter zeroing idioms like sub r1, r1, r1 even on ISAs like ARM that don't have an architectural zero register. mov r1, #0 is the same size and at least as efficient as any other way. On MIPS you'd just move $v0, $zero)
So yes, for those non-x86 ISAs, they are very different in asm. n=0 avoids any false dependency on the old value of the variable (register), while n=n-n can't execute until the old value of n is ready.
Only x86 special-cases sub same,same and xor same,same as a dependency-breaking zeroing idiom like mov eax, imm32, because mov eax, 0 is 5 bytes but xor eax,eax is only 2. So there was a long history of using this peephole optimization before out-of-order execution CPUs, and such CPUs needed to run existing code efficiently. What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains the details.
Unless you're writing by hand in x86 asm, write 0 like a normal person instead of n-n or n^n, and let the compiler use xor-zeroing as a peephole optimization.
Asm for other ISAs might have other peepholes, e.g. another answer mentions m68k. But again, if you're writing in C this is the compiler's job. Write 0 when you mean 0. Trying to "hand hold" the compiler into using an asm peephole is very unlikely to work with optimization disabled, and with optimization enabled the compiler will efficiently zero a register if it needs to.
not sure about assembly and such, but generally,
n=0
n=n-n
isnt always equal if n is floating point, see here
http://www.codinghorror.com/blog/archives/001266.html
Here are some corner cases where the behavior is different for n = 0 and n = n - n:
if n has a floating point type, the result will differ from 0 for specific values: -0.0, Infinity, -Infinity, NaN...
if n is defined as volatile: the first expression will generate a single store into the corresponding memory location, while the second expression will generate two loads and a store, furthermore if n is the location of a hardware register, the 2 loads might yield different values, causing the write to store a non 0 value.
if optimisations are disabled, the compiler might generate different code for these 2 expressions even for plain int n, which might or might not execute at the speed.