Speed with and without 'static' [closed]

Speed with and without 'static' [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I know that 'static' is about scope, but I've got a question: what function/variable will be faster to access: a 'static' one or not?
Which code will be faster:
#include <stdio.h>
int main(){
int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
or
#include <stdio.h>
int main(){
static int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
In my code I'm working with VERY big numbers (with unsigned long long) and I'm accessing and increasing them about 4.000.000 times a second. This code is not the one I'm working on, it's just an example.

As a sign of good will, I have made up a program that we can actually reason about.
#include <stdint.h>
#include <stdio.h>
int
main()
{
static const uint64_t a = 1664525UL;
static const uint64_t c = 1013904223UL;
static const uint64_t m = (1UL << 31);
static uint32_t x = 1;
register unsigned i;
for (i = 0; i < 1000000000U; ++i)
x = (a * x + c) % m;
printf("%d\n", x);
return 0;
}
It will simply compute the one billionth element of a pseudo random sequence returned by a simple linear congruential generator. We have to do something more difficult than simply increment a counter or the compiler will optimize the entire loop out of existence.
Here is how I have compiled (GCC 4.9.1 on x86_64 GNU/Linux):
$ gcc -o non-static -Dstatic= -Wall -O3 main.c
$ gcc -o static -Wall -O3 main.c
To get the version without static, we simply #define it away on the compiler command line.
Running both programs took 2.36 seconds meaning there is no measurable performance difference.
To find out why, I like to look at the assembly code.
$ gcc -S -o non-static.s -Dstatic= -Wall -O3 main.c
$ gcc -S -o static.s -Wall -O3 main.c
We find that GCC generated identical machine code for the inner loop and moved the special treatment for the static variables out of the loop, which is what we should have expected from a good compiler.
Relevant code with static:
main:
.LFB11:
.cfi_startproc
movl x.2266(%rip), %esi
movl $1000000000, %eax
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
movl %esi, x.2266(%rip)
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
and without:
main:
.LFB11:
.cfi_startproc
movl $1000000000, %eax
movl $1, %esi
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
This just re-emphasizes what many have tried to express in their comments: We need actual code to reason about performance and we should really benchmark it while doing so.
Also, you shouldn't worry too much about such things and trust your compiler most of the time. Focus on writing readable and maintainable code and only fiddle with the dirty details if you have evidence that it is necessary to achieve the required performance. In your particular example, I cannot see any valid reason to declare the local variables static. It disturbs me as a reader and should not be done.

Related

GCC option to avoid function calls to simple get/set functions? [duplicate]

This question already has answers here:
Can gcc or clang inline functions that are not in the same compilation unit?
(1 answer)
How do I force gcc to inline a function?
(8 answers)
C, inline function and GCC [duplicate]
(4 answers)
In C, should inline functions in headers be externed in the .c file?
(2 answers)
Closed 7 days ago.
There maybe a very simple solution to this problem but it has been bothering me for a while, so I have to ask.
In our embedded projects, it seems common to have simple get/set functions to many variables in separate C-files. Then, those variables are being called from many other C-files. When I look the assembly listing, those function calls are never replaced with move instructions. Faster way would be to just declare monitored variables as global variables to avoid unnecessary function calls.
Let's say you have a file.c which has variables that need to be monitored in another C-file main.c. For example, debugging variables, hardware registers, adc-values, etc. Is there a compiler optimization that replaces simple get/set functions with assembly move instructions thus avoiding unnecessary overhead caused by function calls?
file.h
#ifndef FILE_H
#define FILE_H
#include <stdint.h>
int32_t get_signal(void);
void set_signal(int32_t x);
#endif
file.c
#include "file.h"
#include <stdint.h>
static volatile int32_t *signal = SOME_HARDWARE_ADDRESS;
int32_t get_signal(void)
{
return *signal;
}
void set_signal(int32_t x)
{
*signal = x;
}
main.c
#include "file.h"
#include <stdio.h>
int main(int argc, char *args[])
{
// Do something with the variable
for (int i = 0; i < 10; i++)
{
printf("signal = %d\n", get_signal());
}
return 0;
}
If I compile the above code with gcc -Wall -save-temps main.c file.c -o main.exe, it gives the following assembly listing for main.c. You can always see the call get_signal even if you compile with -O3 flag which seems silly as we are only reading memory address. Why bother calling such simple function?
Same explanation applies for the simple set function. It is always called even though we would be only writing to one memory location in the function and doing nothing else.
main.s
main:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $48, %rsp
.seh_stackalloc 48
.seh_endprologue
movl %ecx, 16(%rbp)
movq %rdx, 24(%rbp)
call __main
movl $0, -4(%rbp)
jmp .L4
.L5:
call get_signal
movl %eax, %edx
leaq .LC0(%rip), %rcx
call printf
addl $1, -4(%rbp)
.L4:
cmpl $9, -4(%rbp)
jle .L5
movl $0, %eax
addq $48, %rsp
popq %rbp
ret
UPDATED 2023-02-13
Question was closed with several links to inline and Link-time Optimization-related answers. I don't think the same question has been answered before or at least the solution is not obvious for my get_function. What is there to inline if a function just returns a value and does nothing else?
Anyways, it seems, as suggested, that one solution to fix this problem is to add compiler flags -O2 -flto which correctly replaces assembly instruction call get_signal with move instruction with the following partial output:
main:
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
call __main
movl tmp.0(%rip), %edx
movl $10, %eax
.p2align 4,,10
.p2align 3
.L4:
movl signal(%rip), %ecx
addl %ecx, %edx
subl $1, %eax
jne .L4
leaq .LC0(%rip), %rcx
movl %edx, tmp.0(%rip)
call printf.constprop.0
xorl %eax, %eax
addq $40, %rsp
ret
.seh_endproc
Thank you.

Why -O1 is faster than -O2 for 10000 times?

Below is a C function to evaluate a polynomial:
/* Calculate a0 + a1*x + a2*x^2 + ... + an*x^n */
/* from CSAPP Ex.5.5, modified to integer version */
int poly(int a[], int x, int degree) {
long int i;
int result = a[0];
int xpwr = x;
for (i = 1; i <= degree; ++i) {
result += a[i]*xpwr;
xpwr *= x;
}
return result;
}
And a main function:
#define TIMES 100000ll
int main(void) {
long long int i;
unsigned long long int result = 0;
for (i = 0; i < TIMES; ++i) {
/* g_a is an int[10000] global variable with all elements equals to 1 */
/* x = 2, i.e. evaluate 1 + 2 + 2^2 + ... + 2^9999 */
result += poly(g_a, 2, 9999);
}
printf("%lld\n", result);
return 0;
}
When I compile the program with GCC and options -O1 and -O2 separately, I found that -O1 is FASTER than -O2 a lot.
Platform details:
i5-4600
Arch Linux x86_64 with kernel 3.18
GCC 4.9.2
gcc -O1 -o /tmp/a.out test.c
gcc -O2 -o /tmp/a.out test.c
Result:
When TIMES = 100000ll, -O1 prints the result instantly, while -O2 needs 0.36s
When TIMES = 1000000000ll, -O1 prints the result in 0.28s, -O2 takes so long that I didn't finish the test
It seems that -O1 is approximately 10000 times faster than -O2.
When I test it on Mac (clang-600.0.56), the result is even more weird: -O1 takes no more than 0.02s even when TIMES = 1000000000000000000ll
I have tested the following changes:
makes g_a random (elements are from 1 to 10)
x = 19234 (or some other number)
use int instead of long long int
And the results are the same.
I tried to look at the assembly code, it seems that -O1 is calling the poly function while -O2 does inline optimization. But inline should make the performance better, isn't it?
What makes these huge differences? Why -O1 on clang can make the program so fast? Is -O1 doing something wrong? (I cannot check the result as it is too slow without optimization)

Here is the assembly code of main for -O1: (you may get it by adding -S option to gcc)
main:
.LFB12:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $9999, %edx
movl $2, %esi
movl $g_a, %edi
call poly
movslq %eax, %rdx
movl $100000, %eax
.L6:
subq $1, %rax
jne .L6
imulq $100000, %rdx, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
And for -O2:
main:
.LFB12:
.cfi_startproc
movl g_a(%rip), %r9d
movl $100000, %r8d
xorl %esi, %esi
.p2align 4,,10
.p2align 3
.L8:
movl $g_a+4, %eax
movl %r9d, %ecx
movl $2, %edx
.p2align 4,,10
.p2align 3
.L7:
movl (%rax), %edi
addq $4, %rax
imull %edx, %edi
addl %edx, %edx
addl %edi, %ecx
cmpq $g_a+40000, %rax
jne .L7
movslq %ecx, %rcx
addq %rcx, %rsi
subq $1, %r8
jne .L8
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC1, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
Although I don't know much about assembly, it is obvious that -O1 is just calling poly once, and multiply the result by 100000 (imulq $100000, %rdx, %rsi). This is the reason that it is so fast.
It seems that gcc can detect that poly is a pure function with no side effect. (It will be interesting if we have another thread modifying g_a while poly is running...)
On the other hand, -O2 has inlined the poly function, so it has no chance to check poly as a pure function.
I have further done some research:
I cannot find the actual flag used by -O1 which do the pure function checking.
I have tried all the flags listed by gcc -Q -O1 --help=optimizers individually, but none of them have the effect.
Maybe it needs a combination of the flags together to get the effect, but it is very hard to try all the combinations.
But I have found the flag used by -O2 which makes the effect disappear, which is the -finline-small-functions flag. The name of the flag explains itself.

One thing that jumps out at me is that you're overflowing signed integers. The behaviour of this is undefined in C. Specifically, int result won't be able to hold pow(2,9999). I don't see what the point is of benchmarking code with undefined behaviour?

Can't set stack boundary gcc

My c code:
#include <stdio.h>
foo()
{
char buffer[8];
}
main()
{
foo();
return 0;
}
I compile it using gcc -ggdb -mpreferred-stack-boundary=2 -o bar bar.c
When I load it using GDB ./bar I see that inside the foo function the code is:
sub $0x0c,$esp
Why is this happening?
I want to buffer to take 8 bytes in the stack so it should be sub $0x8,$esp!
Why can't I set stack boundary to 4 bytes?
Help!

I can't reproduce exactly what you are seeing, but on my 4.8.2 version of gcc, the option does affect the amount of stack used with this code (make sure "buffer" is used to avoid it being optimised away, and fix the warnings for no return type/argument types):
#include <stdio.h>
void foo(void)
{
char buffer[8];
buffer[0] = 'a';
buffer[1] = '\n';
buffer[2] = 0;
printf("my first program! %s\n", buffer);
}
int main()
{
foo();
return 0;
}
Compiled with -mpreferred-stack-boundary=2 and -mpreferred-stack-boundary=4, and the difference between the generated assembler is notable:
$ diff -u stb-2.s stb-4.s
--- stb-2.s 2014-04-10 09:00:39.546038191 +0100
+++ stb-4.s 2014-04-10 09:00:58.895108979 +0100
## -15,11 +15,11 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
- subl $16, %esp
- movb $97, -8(%ebp)
- movb $10, -7(%ebp)
- movb $0, -6(%ebp)
- leal -8(%ebp), %eax
+ subl $40, %esp
+ movb $97, -16(%ebp)
+ movb $10, -15(%ebp)
+ movb $0, -14(%ebp)
+ leal -16(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
.LEHB0:
## -67,9 +67,10 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
+ andl $-16, %esp
call _Z3foov
movl $0, %eax
- popl %ebp
+ leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
So, at least in gcc 4.8.2. for x86-32, the option has an effect.
Of course, the default according to the docs is -mpreferred-stack-boundary=2, so maybe that's why you can't see any difference from "without" (Although in my experiments, it seems that it's -mpreferred-stack-boundary=4). [Moment passes] Ah, the default has been changed over time, so the 4.4.2 docs online says 2, my info gcc for 4.8.2 says 4, which explains the difference.
As to why your code is allocating twelve bytes of stack-space - look at how printf is called:
movl $.LC0, (%esp)
call printf
If the compiler can, it will pre-allocate argument space for function calls at the start of the function, rather than use push $.LC0 as it would be in this case. It's not much difference, but it saves at least one instruction for cleanup at the other side of printf (and it makes it MUCH easier to deal with stack-relative offsets within the produced code, since the compiler doesn't have to keep track of where the current stack-pointer is - it's always at a constant place after the prologue code at the beginning of the function, all the way to the end of the function). Since the space is ultimately required anyway, there's no point in "saving 4 bytes".

Can/do C compilers optimize out adress-of in inline functions?

Let's say I have following code:
int f() {
int foo = 0;
int bar = 0;
foo++;
bar++;
// many more repeated operations in actual code
foo++;
bar++;
return foo+bar;
}
Abstracting repeated code into a separate functions, we get
static void change_locals(int *foo_p, int *bar_p) {
*foo_p++;
*bar_p++;
}
int f() {
int foo = 0;
int bar = 0;
change_locals(&foo, &bar);
change_locals(&foo, &bar);
return foo+bar;
}
I'd expect the compiler to inline the change_locals function, and optimize things like *(&foo)++ in the resulting code to foo++.
If I remember correctly, taking address of a local variable usually prevents some optimizations (e.g. it can't be stored in registers), but does this apply when no pointer arithmetic is done on the address and it doesn't escape from the function? With a larger change_locals, would it make a difference if it was declared inline (__inline in MSVC)?
I am particularly interested in behavior of GCC and MSVC compilers.

inline (and all its cousins _inline, __inline...) are ignored by gcc. It might inline anything it decides is an advantage, except at lower optimization levels.
The code procedure by gcc -O3 for x86 is:
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
.ident "GCC: (GNU) 4.4.4 20100630 (Red Hat 4.4.4-10)"
It returns zero because *ptr++ doesn't do what you think. Correcting the increments to:
(*foo_p)++;
(*bar_p)++;
results in
.text
.p2align 4,,15
.globl f
.type f, #function
f:
pushl %ebp
movl $4, %eax
movl %esp, %ebp
popl %ebp
ret
So it directly returns 4. Not only did it inline them, but it optimized the calculations away.
Vc++ from vs 2005 provides similar code, but it also created unreachable code for change_locals(). I used the command line
/O2 /FD /EHsc /MD /FA /c /TP

If I remember correctly, taking
address of a local variable usually
prevents some optimizations (e.g. it
can't be stored in registers), but
does this apply when no pointer
arithmetic is done on the address and
it doesn't escape from the function?
The general answer is that if the compiler can ensure that no one else will change a value behind its back, it can safely be placed in a register.
Think of this as though the compiler first performs inlining, then transforms all those *&foo (which results from the inlining) to simply foo before deciding if they should be placed in registers on in memory on the stack.
With a larger change_locals, would it
make a difference if it was declared
inline (__inline in MSVC)?
Again, generally speaking, whether or not a compiler decides to inline something is done using heuristics. If you explicitly specify that you want something to be inlines, the compiler will probably weight this into its decision process.

I've tested gcc 4.5, MSC and IntelC using this:
#include <stdio.h>
void change_locals(int *foo_p, int *bar_p) {
(*foo_p)++;
(*bar_p)++;
}
int main() {
int foo = printf("");
int bar = printf("");
change_locals(&foo, &bar);
change_locals(&foo, &bar);
printf( "%i\n", foo+bar );
}
And all of them did inline/optimize the foo+bar value, but also did
generate the code for change_locals() (but didn't use it).
Unfortunately, there's still no guarantee that they'd do the same for
any kind of such a "local function".
gcc:
__Z13change_localsPiS_:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
incl (%edx)
incl (%eax)
leave
ret
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
pushl %ebx
subl $28, %esp
call ___main
movl $LC0, (%esp)
call _printf
movl %eax, %ebx
movl $LC0, (%esp)
call _printf
leal 4(%ebx,%eax), %eax
movl %eax, 4(%esp)
movl $LC1, (%esp)
call _printf
xorl %eax, %eax
addl $28, %esp
popl %ebx
leave
ret

When should I omit the frame pointer?

Is there any substantial optimization when omitting the frame pointer?
If I have understood correctly by reading this page, -fomit-frame-pointer is used when we want to avoid saving, setting up and restoring frame pointers.
Is this done only for each function call and if so, is it really worth to avoid a few instructions for every function?
Isn't it trivial for an optimization.
What are the actual implications of using this option apart from the debugging limitations?
I compiled the following C code with and without this option
int main(void)
{
int i;
i = myf(1, 2);
}
int myf(int a, int b)
{
return a + b;
}
,
# gcc -S -fomit-frame-pointer code.c -o withoutfp.s
# gcc -S code.c -o withfp.s
.
diff -u 'ing the two files revealed the following assembly code:
--- withfp.s 2009-12-22 00:03:59.000000000 +0000
+++ withoutfp.s 2009-12-22 00:04:17.000000000 +0000
## -7,17 +7,14 ##
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
- pushl %ebp
- movl %esp, %ebp
pushl %ecx
- subl $36, %esp
+ subl $24, %esp
movl $2, 4(%esp)
movl $1, (%esp)
call myf
- movl %eax, -8(%ebp)
- addl $36, %esp
+ movl %eax, 20(%esp)
+ addl $24, %esp
popl %ecx
- popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
## -25,11 +22,8 ##
.globl myf
.type myf, #function
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
.size myf, .-myf
.ident "GCC: (GNU) 4.2.1 20070719
Could someone please shed light on the key points of the above code where -fomit-frame-pointer did actually make the difference?
Edit: objdump's output replaced with gcc -S's

-fomit-frame-pointer allows one extra register to be available for general-purpose use. I would assume this is really only a big deal on 32-bit x86, which is a bit starved for registers.*
One would expect to see EBP no longer saved and adjusted on every function call, and probably some additional use of EBP in normal code, and fewer stack operations on occasions where EBP gets used as a general-purpose register.
Your code is far too simple to see any benefit from this sort of optimization-- you're not using enough registers. Also, you haven't turned on the optimizer, which might be necessary to see some of these effects.
* ISA registers, not micro-architecture registers.

The only downside of omitting it is that debugging is much more difficult.
The major upside is that there is one extra general purpose register which can make a big difference on performance. Obviously this extra register is used only when needed (probably in your very simple function it isn't); in some functions it makes more difference than in others.

You can often get more meaningful assembly code from GCC by using the -S argument to output the assembly:
$ gcc code.c -S -o withfp.s
$ gcc code.c -S -o withoutfp.s -fomit-frame-pointer
$ diff -u withfp.s withoutfp.s
GCC doesn't care about the address, so we can compare the actual instructions generated directly. For your leaf function, this gives:
myf:
- pushl %ebp
- movl %esp, %ebp
- movl 12(%ebp), %eax
- addl 8(%ebp), %eax
- popl %ebp
+ movl 8(%esp), %eax
+ addl 4(%esp), %eax
ret
GCC doesn't generate the code to push the frame pointer onto the stack, and this changes the relative address of the arguments passed to the function on the stack.

Profile your program to see if there is a significant difference.
Next, profile your development process. Is debugging easier or more difficult? Do you spend more time developing or less?
Optimizations without profiling are a waste of time and money.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Speed with and without 'static' [closed] - c

Related

GCC option to avoid function calls to simple get/set functions? [duplicate]

Why -O1 is faster than -O2 for 10000 times?

Can't set stack boundary gcc

Can/do C compilers optimize out adress-of in inline functions?

When should I omit the frame pointer?

Categories

Resources