Why does the compiler allocate more than needed in the stack?

Why does the compiler allocate more than needed in the stack? - c

I have a simple C program. Let's say, for example, I have an int and a char array of length 20. I need 24 bytes in total.
int main()
{
char buffer[20];
int x = 0;
buffer[0] = 'a';
buffer[19] = 'a';
}
The stack needs to be aligned to a 16 bytes boundary, so I presume a compiler will reserve 32 bytes. But when I compile such a program with gcc x86-64 and read the output assembly, the compiler reserves 64 bytes.
..\gcc -S -o main.s main.c
Gives me:
.file "main.c"
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
pushq %rbp # RBP is pushed, so no need to reserve more for it
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $64, %rsp # Reserving the 64 bytes
.seh_stackalloc 64
.seh_endprologue
call __main
movl $0, -4(%rbp) # Using the first 4 bytes to store the int
movb $97, -32(%rbp) # Using from RBP-32
movb $97, -13(%rbp) # to RBP-13 to store the char array
movl $0, %eax
addq $64, %rsp # Restoring the stack with the last 32 bytes unused
popq %rbp
ret
.seh_endproc
.ident "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 5.2.0"
Why is that? When I program assembly, I always reserve only the minimum memory I need without any problem. Is that a limitation of the compiler which has trouble evaluating the needed memory or is there a reason for that?
Here is gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=D:/Mingw64/bin/../libexec/gcc/x86_64-w64-mingw32/5.2.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../../../src/gcc-5.2.0/configure --host=x86_64-w64-mingw32 --build=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --prefix=/mingw64 --with-sysroot=/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64 --with-gxx-include-dir=/mingw64/x86_64-w64-mingw32/include/c++ --enable-shared --enable-static --disable-multilib --enable-languages=c,c++,fortran,objc,obj-c++,lto --enable-libstdcxx-time=yes --enable-threads=posix --enable-libgomp --enable-libatomic --enable-lto --enable-graphite --enable-checking=release --enable-fully-dynamic-string --enable-version-specific-runtime-libs --disable-isl-version-check --disable-libstdcxx-pch --disable-libstdcxx-debug --enable-bootstrap --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-gnu-as --with-gnu-ld --with-arch=nocona --with-tune=core2 --with-libiconv --with-system-zlib --with-gmp=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-mpfr=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-mpc=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-isl=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-pkgversion='x86_64-posix-seh-rev0, Built by MinGW-W64 project' --with-bugurl=http://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe -I/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/include -I/c/mingw520/prerequisites/x86_64-zlib-static/include -I/c/mingw520/prerequisites/x86_64-w64-mingw32-static/include' CXXFLAGS='-O2 -pipe -I/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/include -I/c/mingw520/prerequisites/x86_64-zlib-static/include -I/c/mingw520/prerequisites/x86_64-w64-mingw32-static/include' CPPFLAGS= LDFLAGS='-pipe -L/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/lib -L/c/mingw520/prerequisites/x86_64-zlib-static/lib -L/c/mingw520/prerequisites/x86_64-w64-mingw32-static/lib '
Thread model: posix
gcc version 5.2.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)

Compilers may indeed reserve additional memory for themselves.
Gcc has a flag, -mpreferred-stack-boundary, to set the alignment it will maintain. According to the documentation, the default is 4, which should produce 16-byte alignment, which needed for SSE instructions.
As VermillionAzure noted in a comment, you should provide your gcc version and compile-time options (use gcc -v to show these).

Because you haven't enabled optimization.
Without optimization, the compiler makes no attempt to minimize the amount of space or time it needs for anything in the generated code -- it just generates code in the most straight-forward way possible.
Add -O2 (or even just -O1) or -Os if you want the compiler to produce decent code.

I need 24 bytes in total.
The compiler needs space for a return address and a base pointer. As you are in 64 bit mode, that's another 16 bytes. Total 40. Round that up to a 32-byte boundary and you get 64.

Related

How do I stop GCC from optimizing this byte-for-byte copy into a memcpy call?

I have this code for memcpy as part of my implementation of the standard C library which copies memory from src to dest one byte at a time:
void *memcpy(void *restrict dest, const void *restrict src, size_t len)
{
char *dp = (char *restrict)dest;
const char *sp = (const char *restrict)src;
while( len-- )
{
*dp++ = *sp++;
}
return dest;
}
With gcc -O2, the code generated is reasonable:
memcpy:
.LFB0:
movq %rdi, %rax
testq %rdx, %rdx
je .L2
xorl %ecx, %ecx
.L3:
movzbl (%rsi,%rcx), %r8d
movb %r8b, (%rax,%rcx)
addq $1, %rcx
cmpq %rdx, %rcx
jne .L3
.L2:
ret
.LFE0:
However, at gcc -O3, GCC optimizes this naive byte-for-byte copy into a memcpy call:
memcpy:
.LFB0:
testq %rdx, %rdx
je .L7
subq $8, %rsp
call memcpy
addq $8, %rsp
ret
.L7:
movq %rdi, %rax
ret
.LFE0:
This won't work (memcpy unconditionally calls itself), and it causes a segfault.
I've tried passing -fno-builtin-memcpy and -fno-loop-optimizations, and the same thing occurs.
I'm using GCC version 8.3.0:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-cros-linux-gnu/8.3.0/lto-wrapper
Target: x86_64-cros-linux-gnu
Configured with: ../configure --prefix=/usr/local --libdir=/usr/local/lib64 --build=x86_64-cros-linux-gnu --host=x86_64-cros-linux-gnu --target=x86_64-cros-linux-gnu --enable-checking=release --disable-multilib --enable-threads=posix --disable-bootstrap --disable-werror --disable-libmpx --enable-static --enable-shared --program-suffix=-8.3.0 --with-arch-64=x86-64
Thread model: posix
gcc version 8.3.0 (GCC)
How do I disable the optimization that causes the copy to be transformed into a memcpy call?

One thing that seems to be sufficient here: instead of using -fno-builtin-memcpy use -fno-builtin for compiling the translation unit of memcpy alone!
An alternative would be to pass -fno-tree-loop-distribute-patterns; though this might be brittle as it forbids the compiler from reorganizing the loop code first and then replacing part of them with calls to mem* functions.
Or, since you cannot rely anything in the C library, perhaps using -ffreestanding could be in order.

This won't work (memcpy unconditionally calls itself), and it causes a segfault.
Redefining memcpy is undefined behavior.
How do I disable the optimization that causes the copy to be transformed into a memcpy call (preferably while still compiling with -O3)?
Don't. The best approach is fixing your code instead:
In most cases, you should use another name.
In the rare case you are really implementing a C library (as discussed in the comments), and you really want to reimplement memcpy, then you should be using compiler-specific options to achieve that. For GCC, see -fno-builtin* and -ffreestanding, as well as -nodefaultlibs and -nostdlib.

How can I change objdump output format?

Hi I'm learning c compiler with this book. https://www.sigbus.info/compilerbook
I want to show the same result as the book shows. What should I do it? I think I need to change the version of gcc, objdump or options.
This book says that it is possible to compile too from the following expected assemble output.
expected
.intel_syntax noprefix
.global main
main:
mov rax, 42
ret
actual
00000000000005fa <main>:
5fa: 55 push rbp
5fb: 48 89 e5 mov rbp,rsp
5fe: b8 2a 00 00 00 mov eax,0x2a
603: 5d pop rbp
604: c3 ret
605: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
60c: 00 00 00
60f: 90 nop
what I did
root#686394c78009:/zcc# uname -a
Linux 686394c78009 4.9.125-linuxkit #1 SMP Fri Sep 7 08:20:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root#686394c78009:/zcc# objdump -v
GNU objdump (GNU Binutils for Ubuntu) 2.30
Copyright (C) 2018 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) any later version.
This program has absolutely no warranty.
root#686394c78009:/zcc# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
root#686394c78009:/zcc# cat test1.c
int main() {
return 42;
}
root#686394c78009:/zcc# gcc -o test1 test1.c
root#686394c78009:/zcc# ./test1
root#686394c78009:/zcc# echo $?
42
root#686394c78009:/zcc# objdump -d -M intel ./test1
Update 1
Generated assembly code with the -S option. Compiling worked from the generated assembly code.
Still there are some differences from my reference book but I will learn more.
And one another curious thing is that the different register name is used respectively. I will look into it too. (I have realized I need to learn from basic..)
// expected
mov rax, 42
// actual
mov eax, 42
root#686394c78009:/zcc# gcc -S -masm=intel test1.c
root#686394c78009:/zcc# cat test1.s
.file "test1.c"
.intel_syntax noprefix
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
mov eax, 42
pop rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
.section .note.GNU-stack,"",#progbits
root#686394c78009:/zcc# gcc -o test1 test1.s
root#686394c78009:/zcc# ./test1
root#686394c78009:/zcc# echo $?
42

Instead of dumping with objdump, try to directly generate assembly code with the -S option for the compiler. With -masm=intel, the output should look similar to what you expect.
Do not expect the compiler to generate the exact same code though. Different compilers and different compiler versions or even the same compiler with different flags may make different choices and generate different assembly for the same code. That's normal.

Allocation of unnecessary buffer [duplicate]

I have a simple C program. Let's say, for example, I have an int and a char array of length 20. I need 24 bytes in total.
int main()
{
char buffer[20];
int x = 0;
buffer[0] = 'a';
buffer[19] = 'a';
}
The stack needs to be aligned to a 16 bytes boundary, so I presume a compiler will reserve 32 bytes. But when I compile such a program with gcc x86-64 and read the output assembly, the compiler reserves 64 bytes.
..\gcc -S -o main.s main.c
Gives me:
.file "main.c"
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
pushq %rbp # RBP is pushed, so no need to reserve more for it
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $64, %rsp # Reserving the 64 bytes
.seh_stackalloc 64
.seh_endprologue
call __main
movl $0, -4(%rbp) # Using the first 4 bytes to store the int
movb $97, -32(%rbp) # Using from RBP-32
movb $97, -13(%rbp) # to RBP-13 to store the char array
movl $0, %eax
addq $64, %rsp # Restoring the stack with the last 32 bytes unused
popq %rbp
ret
.seh_endproc
.ident "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 5.2.0"
Why is that? When I program assembly, I always reserve only the minimum memory I need without any problem. Is that a limitation of the compiler which has trouble evaluating the needed memory or is there a reason for that?
Here is gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=D:/Mingw64/bin/../libexec/gcc/x86_64-w64-mingw32/5.2.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with: ../../../src/gcc-5.2.0/configure --host=x86_64-w64-mingw32 --build=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --prefix=/mingw64 --with-sysroot=/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64 --with-gxx-include-dir=/mingw64/x86_64-w64-mingw32/include/c++ --enable-shared --enable-static --disable-multilib --enable-languages=c,c++,fortran,objc,obj-c++,lto --enable-libstdcxx-time=yes --enable-threads=posix --enable-libgomp --enable-libatomic --enable-lto --enable-graphite --enable-checking=release --enable-fully-dynamic-string --enable-version-specific-runtime-libs --disable-isl-version-check --disable-libstdcxx-pch --disable-libstdcxx-debug --enable-bootstrap --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-gnu-as --with-gnu-ld --with-arch=nocona --with-tune=core2 --with-libiconv --with-system-zlib --with-gmp=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-mpfr=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-mpc=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-isl=/c/mingw520/prerequisites/x86_64-w64-mingw32-static --with-pkgversion='x86_64-posix-seh-rev0, Built by MinGW-W64 project' --with-bugurl=http://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe -I/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/include -I/c/mingw520/prerequisites/x86_64-zlib-static/include -I/c/mingw520/prerequisites/x86_64-w64-mingw32-static/include' CXXFLAGS='-O2 -pipe -I/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/include -I/c/mingw520/prerequisites/x86_64-zlib-static/include -I/c/mingw520/prerequisites/x86_64-w64-mingw32-static/include' CPPFLAGS= LDFLAGS='-pipe -L/c/mingw520/x86_64-520-posix-seh-rt_v4-rev0/mingw64/opt/lib -L/c/mingw520/prerequisites/x86_64-zlib-static/lib -L/c/mingw520/prerequisites/x86_64-w64-mingw32-static/lib '
Thread model: posix
gcc version 5.2.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)

Compilers may indeed reserve additional memory for themselves.
Gcc has a flag, -mpreferred-stack-boundary, to set the alignment it will maintain. According to the documentation, the default is 4, which should produce 16-byte alignment, which needed for SSE instructions.
As VermillionAzure noted in a comment, you should provide your gcc version and compile-time options (use gcc -v to show these).

Because you haven't enabled optimization.
Without optimization, the compiler makes no attempt to minimize the amount of space or time it needs for anything in the generated code -- it just generates code in the most straight-forward way possible.
Add -O2 (or even just -O1) or -Os if you want the compiler to produce decent code.

I need 24 bytes in total.
The compiler needs space for a return address and a base pointer. As you are in 64 bit mode, that's another 16 bytes. Total 40. Round that up to a 32-byte boundary and you get 64.

thread local storage in assembly

I want to increment a TLS variable in assembly but is gives a segmentation fault in the assembly code. I don't want to let compiler change any other register or memory. Is there a way to do this without using gcc input and output syntax?
__thread unsigned val;
int main() {
val = 0;
asm("incl %gs:val");
return 0;
}

If you really really need to be able to do this for some reason, you should access a thread-local variable from assembly language by preloading its address in C, like this:
__thread unsigned val;
void incval(void)
{
unsigned *vp = &val;
asm ("incl\t%0" : "+m" (*vp));
}
This is because the code sequence required to access a thread-local variable is different for just about every OS and CPU combination supported by GCC, and also varies if you're compiling for a shared library rather than an executable (i.e. with -fPIC). The above construct allows the compiler to emit the correct code sequence for you. In cases where it is possible to access the thread-local variable without any extra instructions, the address generation will be folded into the assembly operation. By way of illustration, here is how gcc 4.7 for x86/Linux compiles the above in several different possible modes (I've stripped out a bunch of assembler directives in all cases, for clarity)...
# -S -O2 -m32 -fomit-frame-pointer
incval:
incl %gs:val#ntpoff
ret
# -S -O2 -m64
incval:
incl %fs:val#tpoff
ret
# -S -O2 -m32 -fomit-frame-pointer -fpic
incval:
pushl %ebx
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
leal val#tlsgd(,%ebx,1), %eax
call ___tls_get_addr#PLT
incl (%eax)
popl %ebx
ret
# -S -O2 -m64 -fpic
incval:
.byte 0x66
leaq val#tlsgd(%rip), %rdi
.value 0x6666
rex64
call __tls_get_addr#PLT
incl (%rax)
ret
Do realize that all four examples would be different if I'd compiled for x86/OSX, and different yet again for x86/Windows.

Unable to printf floating point numbers from executable shared library

I'm developing a shared library which can be executed independently to print it's own version number.
I've defined a custom entry point as:
const char my_interp[] __attribute__((section(".interp"))) = "/lib64/ld-linux-x86-64.so.2";
void my_main() {
printf("VERSION: %d\n", 0);
_exit(0);
}
and I compile with
gcc -o list.os -c -g -Wall -fPIC list.c
gcc -o liblist.so -g -Wl,-e,my_main -shared list.os -lc
This code compiles and runs perfectly.
My issue is when I change the parameter of the printf to be a float or double (%f or %lf). The library will then compile but segfault when run.
Anyone have any ideas?
edit1:
Here is the code that segfaults:
const char my_interp[] __attribute__((section(".interp"))) = "/lib64/ld-linux-x86-64.so.2";
void my_main() {
printf("VERSION: %f\n", 0.1f);
_exit(0);
}
edit2:
Additional environmental details:
uname -a
Linux mjolnir.site 3.1.10-1.16-desktop #1 SMP PREEMPT Wed Jun 27 05:21:40 UTC 2012 (d016078) x86_64 x86_64 x86_64 GNU/Linux
gcc --version
gcc (SUSE Linux) 4.6.2
/lib64/libc.so.6
Configured for x86_64-suse-linux.
Compiled by GNU CC version 4.6.2.
Compiled on a Linux 3.1.0 system on 2012-03-30.
edit 3:
Output in /var/log/messages upon segfault:
Aug 11 08:27:45 mjolnir kernel: [10560.068741] liblist.so[11222] general protection ip:7fc2b3cb2314 sp:7fff4f5c7de8 error:0 in libc-2.14.1.so[7fc2b3c63000+187000]

Figured it out. :)
The floating point operations on x86_64 use the xmm vector registers. Access to these must be aligned on 16byte boundaries. This explains why 32bit platforms were unaffected and integer and character printing worked.
I've compiled my code to assembly with:
gcc -W list.c -o list.S -shared -Wl,-e,my_main -S -fPIC
then altered the "my_main" function to be have more stack space.
Before:
my_main:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %eax
movsd .LC1(%rip), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
movl $0, %edi
call _exit
.cfi_endproc
After:
my_main:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
subq $8, %rsp ;;;;;;;;;;;;;;; ADDED THIS LINE
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %eax
movsd .LC1(%rip), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
movl $0, %edi
call _exit
.cfi_endproc
Then I compiled this .S file by:
gcc list.S -o liblist.so -Wl,-e,my_main -shared
This fixes the issue, but I will forward this thread to the GCC and GLIBC mailing lists, as it looks like a bug.
edit1:
According to noshadow in gcc irc, this is a non standard way to do this. He said if one is to use gcc -e option, either initialize the C runtime manually, or don't use libc functions. Makes sense.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does the compiler allocate more than needed in the stack? - c

I need 24 bytes in total. The compiler needs space for a return address and a base pointer. As you are in 64 bit mode, that's another 16 bytes. Total 40. Round that up to a 32-byte boundary and you get 64.

Related

How do I stop GCC from optimizing this byte-for-byte copy into a memcpy call?

How can I change objdump output format?

Allocation of unnecessary buffer [duplicate]

thread local storage in assembly

Unable to printf floating point numbers from executable shared library

Categories

Resources