This question already has answers here:
Can gcc or clang inline functions that are not in the same compilation unit?
(1 answer)
How do I force gcc to inline a function?
(8 answers)
C, inline function and GCC [duplicate]
(4 answers)
In C, should inline functions in headers be externed in the .c file?
(2 answers)
Closed 7 days ago.
There maybe a very simple solution to this problem but it has been bothering me for a while, so I have to ask.
In our embedded projects, it seems common to have simple get/set functions to many variables in separate C-files. Then, those variables are being called from many other C-files. When I look the assembly listing, those function calls are never replaced with move instructions. Faster way would be to just declare monitored variables as global variables to avoid unnecessary function calls.
Let's say you have a file.c which has variables that need to be monitored in another C-file main.c. For example, debugging variables, hardware registers, adc-values, etc. Is there a compiler optimization that replaces simple get/set functions with assembly move instructions thus avoiding unnecessary overhead caused by function calls?
file.h
#ifndef FILE_H
#define FILE_H
#include <stdint.h>
int32_t get_signal(void);
void set_signal(int32_t x);
#endif
file.c
#include "file.h"
#include <stdint.h>
static volatile int32_t *signal = SOME_HARDWARE_ADDRESS;
int32_t get_signal(void)
{
return *signal;
}
void set_signal(int32_t x)
{
*signal = x;
}
main.c
#include "file.h"
#include <stdio.h>
int main(int argc, char *args[])
{
// Do something with the variable
for (int i = 0; i < 10; i++)
{
printf("signal = %d\n", get_signal());
}
return 0;
}
If I compile the above code with gcc -Wall -save-temps main.c file.c -o main.exe, it gives the following assembly listing for main.c. You can always see the call get_signal even if you compile with -O3 flag which seems silly as we are only reading memory address. Why bother calling such simple function?
Same explanation applies for the simple set function. It is always called even though we would be only writing to one memory location in the function and doing nothing else.
main.s
main:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $48, %rsp
.seh_stackalloc 48
.seh_endprologue
movl %ecx, 16(%rbp)
movq %rdx, 24(%rbp)
call __main
movl $0, -4(%rbp)
jmp .L4
.L5:
call get_signal
movl %eax, %edx
leaq .LC0(%rip), %rcx
call printf
addl $1, -4(%rbp)
.L4:
cmpl $9, -4(%rbp)
jle .L5
movl $0, %eax
addq $48, %rsp
popq %rbp
ret
UPDATED 2023-02-13
Question was closed with several links to inline and Link-time Optimization-related answers. I don't think the same question has been answered before or at least the solution is not obvious for my get_function. What is there to inline if a function just returns a value and does nothing else?
Anyways, it seems, as suggested, that one solution to fix this problem is to add compiler flags -O2 -flto which correctly replaces assembly instruction call get_signal with move instruction with the following partial output:
main:
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
call __main
movl tmp.0(%rip), %edx
movl $10, %eax
.p2align 4,,10
.p2align 3
.L4:
movl signal(%rip), %ecx
addl %ecx, %edx
subl $1, %eax
jne .L4
leaq .LC0(%rip), %rcx
movl %edx, tmp.0(%rip)
call printf.constprop.0
xorl %eax, %eax
addq $40, %rsp
ret
.seh_endproc
Thank you.
Related
This question already has answers here:
Why does the compiler reserve a little stack space but not the whole array size?
(2 answers)
Stack allocation, padding, and alignment
(6 answers)
Closed 2 years ago.
How does gcc decides how much memory allocate for stack and why does it not decrement %rsp anymore when I remove printf() (or any function call) from my main?
1. I noticed when I played around with a code sample: https://godbolt.org/z/fQqkNE that the 6th line in gcc assembly viewer subq $48, %rsp gets removed if I remove printf() from my C code on line 22. It looks like when I don’t make any function calls from within my main, then the %rsp does not get decremented, but data still gets allocated based on %rbp and offsets. I thought %rsp changes only when stack grows. My theory is that since it won’t make any other function calls, it knows that it won’t need to keep stack for other nonexistent functions. But shouldn’t %rsp still grow as data is getting saved?
2. When adding variables to my rect struct, I also noticed that it sometimes allocates memory in steps greater than what the added data type size was. What is the convention it follows when deciding how much memory to allocate to stack?
3. Is there an online tool that would take assembly code as input, and then draw an image of stack and tell me state of every register at any point of execution? Godbolt.org is a very good tool, I just wish it had these 2 extra features.
I'll paste the code below in case the link to godbolt stops working in the future:
#include <stdio.h>
#include <stdint.h>
struct rect {
int a;
int b;
int* c;
int d[2];
uint8_t f;
};
int main() {
int arr[2] = {2, 3};
struct rect Rect;
Rect.a = 10;
Rect.b = 20;
Rect.c = arr;
Rect.d[0] = Rect.a;
Rect.d[1] = Rect.b;
Rect.f =255;
printf("%d and %d", Rect.a, Rect.b);
return 0;
}
.LC0:
.string "%d and %d"
main:
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl $2, -8(%rbp)
movl $3, -4(%rbp)
movl $10, -48(%rbp)
movl $20, -44(%rbp)
leaq -8(%rbp), %rax
movq %rax, -40(%rbp)
movl -48(%rbp), %eax
movl %eax, -32(%rbp)
movl -44(%rbp), %eax
movl %eax, -28(%rbp)
movb $-1, -24(%rbp)
movl -44(%rbp), %edx
movl -48(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
P.S.: The book I follow uses AT&T syntax for teaching x86. Which is weird because it makes finding online tutorials much harder.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I know that 'static' is about scope, but I've got a question: what function/variable will be faster to access: a 'static' one or not?
Which code will be faster:
#include <stdio.h>
int main(){
int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
or
#include <stdio.h>
int main(){
static int count;
for (count=0;count<1000;++count)
printf("%d\n",count);
return 0;
}
In my code I'm working with VERY big numbers (with unsigned long long) and I'm accessing and increasing them about 4.000.000 times a second. This code is not the one I'm working on, it's just an example.
As a sign of good will, I have made up a program that we can actually reason about.
#include <stdint.h>
#include <stdio.h>
int
main()
{
static const uint64_t a = 1664525UL;
static const uint64_t c = 1013904223UL;
static const uint64_t m = (1UL << 31);
static uint32_t x = 1;
register unsigned i;
for (i = 0; i < 1000000000U; ++i)
x = (a * x + c) % m;
printf("%d\n", x);
return 0;
}
It will simply compute the one billionth element of a pseudo random sequence returned by a simple linear congruential generator. We have to do something more difficult than simply increment a counter or the compiler will optimize the entire loop out of existence.
Here is how I have compiled (GCC 4.9.1 on x86_64 GNU/Linux):
$ gcc -o non-static -Dstatic= -Wall -O3 main.c
$ gcc -o static -Wall -O3 main.c
To get the version without static, we simply #define it away on the compiler command line.
Running both programs took 2.36 seconds meaning there is no measurable performance difference.
To find out why, I like to look at the assembly code.
$ gcc -S -o non-static.s -Dstatic= -Wall -O3 main.c
$ gcc -S -o static.s -Wall -O3 main.c
We find that GCC generated identical machine code for the inner loop and moved the special treatment for the static variables out of the loop, which is what we should have expected from a good compiler.
Relevant code with static:
main:
.LFB11:
.cfi_startproc
movl x.2266(%rip), %esi
movl $1000000000, %eax
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
movl %esi, x.2266(%rip)
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
and without:
main:
.LFB11:
.cfi_startproc
movl $1000000000, %eax
movl $1, %esi
.p2align 4,,10
.p2align 3
.L2: # BEGIN LOOP
imull $1664525, %esi, %esi
addl $1013904223, %esi
andl $2147483647, %esi
subl $1, %eax
jne .L2 # END LOOP
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
This just re-emphasizes what many have tried to express in their comments: We need actual code to reason about performance and we should really benchmark it while doing so.
Also, you shouldn't worry too much about such things and trust your compiler most of the time. Focus on writing readable and maintainable code and only fiddle with the dirty details if you have evidence that it is necessary to achieve the required performance. In your particular example, I cannot see any valid reason to declare the local variables static. It disturbs me as a reader and should not be done.
I am doing some extended assembly optimization on gnu C code running on 64 bit linux. I wanted to print debugging messages from within the assembly code and that's how I came accross the following. I am hoping someone can explain what I am supposed to do in this situation.
Take a look at this sample function:
void test(int a, int b, int c, int d){
__asm__ volatile (
"movq $0, %%rax\n\t"
"pushq %%rax\n\t"
"popq %%rax\n\t"
:
:"m" (a)
:"cc", "%rax"
);
}
Since the four agruments to the function are of class INTEGER, they will be passed through registers and then pushed onto the stack. The strange thing to me is how gcc actually does it:
test:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movl %ecx, -16(%rbp)
movq $0, %rax
pushq %rax
popq %rax
popq %rbp
ret
The passed arguments are pushed onto the stack, but the stack pointer is not decremented. Thus, when I do pushq %rax, the values of a and b are overwritten.
What I am wondering: is there a way to ask gcc to properly set up the local stack? Am I simply not supposed to use push and pop in function calls?
x86-64 abi provides a 128 byte red zone under the stack pointer, and the compiler decided to use that. You can turn that off using -mno-red-zone option.
My c code:
#include <stdio.h>
foo()
{
char buffer[8];
}
main()
{
foo();
return 0;
}
I compile it using gcc -ggdb -mpreferred-stack-boundary=2 -o bar bar.c
When I load it using GDB ./bar I see that inside the foo function the code is:
sub $0x0c,$esp
Why is this happening?
I want to buffer to take 8 bytes in the stack so it should be sub $0x8,$esp!
Why can't I set stack boundary to 4 bytes?
Help!
I can't reproduce exactly what you are seeing, but on my 4.8.2 version of gcc, the option does affect the amount of stack used with this code (make sure "buffer" is used to avoid it being optimised away, and fix the warnings for no return type/argument types):
#include <stdio.h>
void foo(void)
{
char buffer[8];
buffer[0] = 'a';
buffer[1] = '\n';
buffer[2] = 0;
printf("my first program! %s\n", buffer);
}
int main()
{
foo();
return 0;
}
Compiled with -mpreferred-stack-boundary=2 and -mpreferred-stack-boundary=4, and the difference between the generated assembler is notable:
$ diff -u stb-2.s stb-4.s
--- stb-2.s 2014-04-10 09:00:39.546038191 +0100
+++ stb-4.s 2014-04-10 09:00:58.895108979 +0100
## -15,11 +15,11 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
- subl $16, %esp
- movb $97, -8(%ebp)
- movb $10, -7(%ebp)
- movb $0, -6(%ebp)
- leal -8(%ebp), %eax
+ subl $40, %esp
+ movb $97, -16(%ebp)
+ movb $10, -15(%ebp)
+ movb $0, -14(%ebp)
+ leal -16(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
.LEHB0:
## -67,9 +67,10 ##
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
+ andl $-16, %esp
call _Z3foov
movl $0, %eax
- popl %ebp
+ leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
So, at least in gcc 4.8.2. for x86-32, the option has an effect.
Of course, the default according to the docs is -mpreferred-stack-boundary=2, so maybe that's why you can't see any difference from "without" (Although in my experiments, it seems that it's -mpreferred-stack-boundary=4). [Moment passes] Ah, the default has been changed over time, so the 4.4.2 docs online says 2, my info gcc for 4.8.2 says 4, which explains the difference.
As to why your code is allocating twelve bytes of stack-space - look at how printf is called:
movl $.LC0, (%esp)
call printf
If the compiler can, it will pre-allocate argument space for function calls at the start of the function, rather than use push $.LC0 as it would be in this case. It's not much difference, but it saves at least one instruction for cleanup at the other side of printf (and it makes it MUCH easier to deal with stack-relative offsets within the produced code, since the compiler doesn't have to keep track of where the current stack-pointer is - it's always at a constant place after the prologue code at the beginning of the function, all the way to the end of the function). Since the space is ultimately required anyway, there's no point in "saving 4 bytes".
I'm studying NASM on Linux 64-bit and have been trying to implement some examples of code. However I got a problem in the following example. The function donothing is implemented in NASM and is supposed to be called in a program implemented in C:
File main.c:
#include <stdio.h>
#include <stdlib.h>
int donothing(int, int);
int main() {
printf(" == %d\n", donothing(1, 2));
return 0;
}
File first.asm
global donothing
section .text
donothing:
push rbp
mov rbp, rsp
mov eax, [rbp-0x4]
pop rbp
ret
What donothing does is nothing more than returning the value of the first parameter. But when donothing is called the value 0 is printed instead of 1. I tried rbp+0x4, but it doesn't work too.
I compile the files using the following command:
nasm -f elf64 first.asm && gcc first.o main.c
Compiling the function 'test' in C by using gcc -s the assembly code generated to get the parameters looks similar to the donothing:
int test(int a, int b) {
return a > b;
}
Assembly generated by gcc for the function 'test' above:
test:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
setg %al
movzbl %al, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
So, what's wrong with donothing?
In x86-64 calling conventions the first few parameters are passed in registers rather than on the stack. In your case you should find the 1 and 2 in RDI and RSI.
As you can see in the compiled C code, it takes a from edi and b from esi (although it goes through an unnecessary intermediate step by placing them in memory)