How could i define a function at run-time in c - c

I am trying to define and call a function at run-time with c language in arm cpu(cortex a72). in order to do that, i implemented a code like below:
#include <stdio.h>
#include <sys/mman.h>
char* ibuf;
int pbuf = 0;
#define ADD_BYTE(val) do{ibuf[pbuf] = val; pbuf++;} while(0)
void (*routine)(void);
void MakeRoutineSimpleFunc(void)
{
//nop
ADD_BYTE(0x00);
ADD_BYTE(0xf0);
ADD_BYTE(0x20);
ADD_BYTE(0xe3);
//bx lr
ADD_BYTE(0x1e);
ADD_BYTE(0xff);
ADD_BYTE(0x2f);
ADD_BYTE(0xe1);
}
int main(void)
{
ibuf = (char*)mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_POPULATE | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
MakeRoutineSimpleFunc();
routine = (void(*)())(ibuf);
routine();
}
as you can see, in above code at first i allocate an executable memory region and assign address of that to ibuf, then i put some simple instruction in ibuf (a "nop" and "bx lr" that means return in arm) and then i try to call this function through a function pointer.
but when i want to call function through function pointer i got an "segmentation fault" error. BTW when i try to run the app with GDB debugger program run successfully without any error.
there is anything that i missed in above code that cause "segmentation fault"?
i want to add, when i add above instructions (a "nop" and "bx lr" that means return in arm) at compile-time to a function like below, function work without any error.
void f2(void)
{
__asm__ volatile (".byte 0x00, 0xf0, 0x20, 0xe3");
__asm__ volatile (".byte 0x1e, 0xff, 0x2f, 0xe1");
}
EDIT1:
in order to check validity of run-time function, i have removed f2 prolog and epilogue with ghidra disassembler, so assembly code of f2 is like this:
**************************************************************
FUNCTION
**************************************************************
undefined FUN_0000083c()
undefined r0:1 <RETURN>
undefined4 Stack[-0x4]:4 local_4
FUN_0000083c XREF[1]: FUN_00000868:000008a4(c)
0000083c 00 f0 20 e3 nop
00000840 00 f0 20 e3 nop
00000844 00 f0 20 e3 nop
00000848 00 f0 20 e3 nop
0000084c 00 f0 20 e3 nop
00000850 00 f0 20 e3 nop
00000854 1e ff 2f e1 bx lr
00000858 00 f0 20 e3 nop
0000085c 00 f0 20 e3 nop
00000860 00 f0 20 e3 nop
00000864 00 f0 20 e3 nop
and also it work again without problem.
EDIT2:
something that i want to add that may be helpful to solve the problem, as i saw in the assembler, compiler call "routine" function with "blx r3" instruction while it call other functions with "bl 'symbol name'". as i know blx can change processor state from ARM to Thumb or vise versa. could this point cause the problem?
EDIT3:
disassemble of main function is something like below:
**************************************************************
FUNCTION
**************************************************************
int __stdcall main(void)
int r0:4 <RETURN>
undefined4 Stack[-0xc]:4 local_c XREF[1]: 00010d44(W)
undefined4 Stack[-0x10]:4 local_10 XREF[1]: 00010d4c(W)
main XREF[4]: Entry Point(*),
_start:00010394(*), 000103a8(*),
.debug_frame::000000a0(*)
00010d34 00 48 2d e9 stmdb sp!,{ r11 lr }
00010d38 04 b0 8d e2 add r11,sp,#0x4
00010d3c 08 d0 4d e2 sub sp,sp,#0x8
00010d40 00 30 a0 e3 mov r3,#0x0
00010d44 04 30 8d e5 str r3,[sp,#local_c]
00010d48 00 30 e0 e3 mvn r3,#0x0
00010d4c 00 30 8d e5 str r3,[sp,#0x0]=>local_10
00010d50 22 30 a0 e3 mov r3,#0x22
00010d54 07 20 a0 e3 mov r2,#0x7
00010d58 01 1a a0 e3 mov r1,#0x1000
00010d5c 00 00 a0 e3 mov r0,#0x0
00010d60 7d fd ff eb bl mmap
00010d64 00 20 a0 e1 cpy r2,r0
00010d68 50 30 9f e5 ldr r3,[->ibuf]
00010d6c 00 20 83 e5 str r2,[r3,#0x0]=>ibuf
00010d70 48 30 9f e5 ldr r3,[->ibuf]
00010d74 00 30 93 e5 ldr r3,[r3,#0x0]=>ibuf
00010d78 03 10 a0 e1 cpy r1,r3
00010d7c 40 00 9f e5 ldr r0=>s_ibuf:_%x_00010e40,[PTR_s_ibuf:_%x_00010d
00010d80 69 fd ff eb bl printf
00010d84 ae fe ff eb bl MakeRoutineSimpleFunc
00010d88 30 30 9f e5 ldr r3,[->ibuf]
00010d8c 00 30 93 e5 ldr r3,[r3,#0x0]=>ibuf
00010d90 03 20 a0 e1 cpy r2,r3
00010d94 2c 30 9f e5 ldr r3,[->routine]
00010d98 00 20 83 e5 str r2,[r3,#0x0]=>routine
00010d9c 24 30 9f e5 ldr r3,[->routine]
00010da0 00 30 93 e5 ldr r3,[r3,#0x0]=>routine
00010da4 33 ff 2f e1 blx r3
00010da8 1c 00 9f e5 ldr r0=>DAT_00010e4c,
00010dac 61 fd ff eb bl puts
00010db0 00 30 a0 e3 mov r3,#0x0
00010db4 03 00 a0 e1 cpy r0,r3
00010db8 04 d0 4b e2 sub sp,r11,#0x4
00010dbc 00 88 bd e8 ldmia sp!,{ r11 pc }
as you can see, routine called with "blx r3" instruction at address "00010da4". and also i printed address of ibuf, it is was "0xb6ff8000".

I think, you can enter the opcodes directly in a string "binary-code" and execute the code using ((void*)STRING)(). However, you may want to read also about how gcc implements trampolines, because this is how gcc generates code that creates code on the stack and jumps the execution there.

Related

Why is this code acting different with a single printf? ucontext.h

When I compile my code below it prints
I am running :)
forever(Until I send KeyboardInterrupt signal to the program),
but when I uncomment // printf("done:%d\n", done);, recompile and run it, it will print only two times, prints done: 1 and then returns.
I'm new to ucontext.h and I'm very confused about how this code is working and
why a single printf is changing whole behavior of the code, if you replace printf with done++; it would do the same but if you replace it with done = 2; it does not affect anything and works as we had the printf commented at first place.
Can anyone explain:
Why is this code acting like this and what's the logic behind it?
Sorry for my bad English,
Thanks a lot.
#include <ucontext.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
register int done = 0;
ucontext_t one;
ucontext_t two;
getcontext(&one);
printf("I am running :)\n");
sleep(1);
if (!done)
{
done = 1;
swapcontext(&two, &one);
}
// printf("done:%d\n", done);
return 0;
}
This is a compiler optimization "problem". When the "printf()" is commented, the compiler deduces that "done" will not be used after the "if (!done)", so it does not set it to 1 as it is not worth. But when the "printf()" is present, "done" is used after "if (!done)", so the compiler sets it.
Assembly code with the "printf()":
$ gcc ctx.c -o ctx -g
$ objdump -S ctx
[...]
int main(void)
{
11e9: f3 0f 1e fa endbr64
11ed: 55 push %rbp
11ee: 48 89 e5 mov %rsp,%rbp
11f1: 48 81 ec b0 07 00 00 sub $0x7b0,%rsp
11f8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
11ff: 00 00
1201: 48 89 45 f8 mov %rax,-0x8(%rbp)
1205: 31 c0 xor %eax,%eax
register int done = 0;
1207: c7 85 5c f8 ff ff 00 movl $0x0,-0x7a4(%rbp) <------- done set to 0
120e: 00 00 00
ucontext_t one;
ucontext_t two;
getcontext(&one);
1211: 48 8d 85 60 f8 ff ff lea -0x7a0(%rbp),%rax
1218: 48 89 c7 mov %rax,%rdi
121b: e8 c0 fe ff ff callq 10e0 <getcontext#plt>
1220: f3 0f 1e fa endbr64
printf("I am running :)\n");
1224: 48 8d 3d d9 0d 00 00 lea 0xdd9(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
122b: e8 70 fe ff ff callq 10a0 <puts#plt>
sleep(1);
1230: bf 01 00 00 00 mov $0x1,%edi
1235: e8 b6 fe ff ff callq 10f0 <sleep#plt>
if (!done)
123a: 83 bd 5c f8 ff ff 00 cmpl $0x0,-0x7a4(%rbp)
1241: 75 27 jne 126a <main+0x81>
{
done = 1;
1243: c7 85 5c f8 ff ff 01 movl $0x1,-0x7a4(%rbp) <----- done set to 1
124a: 00 00 00
swapcontext(&two, &one);
124d: 48 8d 95 60 f8 ff ff lea -0x7a0(%rbp),%rdx
1254: 48 8d 85 30 fc ff ff lea -0x3d0(%rbp),%rax
125b: 48 89 d6 mov %rdx,%rsi
125e: 48 89 c7 mov %rax,%rdi
1261: e8 6a fe ff ff callq 10d0 <swapcontext#plt>
1266: f3 0f 1e fa endbr64
}
printf("done:%d\n", done);
126a: 8b b5 5c f8 ff ff mov -0x7a4(%rbp),%esi
1270: 48 8d 3d 9d 0d 00 00 lea 0xd9d(%rip),%rdi # 2014 <_IO_stdin_used+0x14>
1277: b8 00 00 00 00 mov $0x0,%eax
127c: e8 3f fe ff ff callq 10c0 <printf#plt>
return 0;
Assembly code without the "printf()":
$ gcc ctx.c -o ctx -g
$ objdump -S ctx
[...]
int main(void)
{
11c9: f3 0f 1e fa endbr64
11cd: 55 push %rbp
11ce: 48 89 e5 mov %rsp,%rbp
11d1: 48 81 ec b0 07 00 00 sub $0x7b0,%rsp
11d8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
11df: 00 00
11e1: 48 89 45 f8 mov %rax,-0x8(%rbp)
11e5: 31 c0 xor %eax,%eax
register int done = 0;
11e7: c7 85 5c f8 ff ff 00 movl $0x0,-0x7a4(%rbp) <------ done set to 0
11ee: 00 00 00
ucontext_t one;
ucontext_t two;
getcontext(&one);
11f1: 48 8d 85 60 f8 ff ff lea -0x7a0(%rbp),%rax
11f8: 48 89 c7 mov %rax,%rdi
11fb: e8 c0 fe ff ff callq 10c0 <getcontext#plt>
1200: f3 0f 1e fa endbr64
printf("I am running :)\n");
1204: 48 8d 3d f9 0d 00 00 lea 0xdf9(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
120b: e8 80 fe ff ff callq 1090 <puts#plt>
sleep(1);
1210: bf 01 00 00 00 mov $0x1,%edi
1215: e8 b6 fe ff ff callq 10d0 <sleep#plt>
if (!done)
121a: 83 bd 5c f8 ff ff 00 cmpl $0x0,-0x7a4(%rbp)
1221: 75 1d jne 1240 <main+0x77>
{
done = 1; <------------- done is no set here (it is optimized by the compiler)
swapcontext(&two, &one);
1223: 48 8d 95 60 f8 ff ff lea -0x7a0(%rbp),%rdx
122a: 48 8d 85 30 fc ff ff lea -0x3d0(%rbp),%rax
1231: 48 89 d6 mov %rdx,%rsi
1234: 48 89 c7 mov %rax,%rdi
1237: e8 74 fe ff ff callq 10b0 <swapcontext#plt>
123c: f3 0f 1e fa endbr64
}
//printf("done:%d\n", done);
return 0;
1240: b8 00 00 00 00 mov $0x0,%eax
}
1245: 48 8b 4d f8 mov -0x8(%rbp),%rcx
1249: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
1250: 00 00
1252: 74 05 je 1259 <main+0x90>
1254: e8 47 fe ff ff callq 10a0 <__stack_chk_fail#plt>
1259: c9 leaveq
125a: c3 retq
125b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
To disable the optimization on "done", add the "volatile" keyword in its definition:
volatile register int done = 0;
This makes the program work in both cases.
(There is some overlap with Rachid K's answer as it was posted while I was writing this.)
I am guessing you are declaring done as register in hopes that it will actually be put in a register, so that its value will be saved and restored by the context switch. But the compiler is never obliged to honor this; most modern compilers ignore register declarations completely and make their own decisions about register usage. And in particular, gcc without optimizations will nearly always put local variables in memory on the stack.
As such, in your test case, the value of done is not restored by the context switch. So when getcontext returns for the second time, done has the same value as when swapcontext was called.
When the printf is present, as Rachid also points out, the done = 1 is actually stored before the swapcontext, so on the second return of getcontext, done has the value 1, the if block is skipped, and the program prints done:1 and exits.
However, when the printf is absent, the compiler notices that the value of done is never used after its assignment (since it assumes swapcontext is a normal function and doesn't know that it will actually return somewhere else), so it optimizes out the dead store (yes, even though optimizations are off). Thus we have done == 0 when getcontext returns the second time, and you get an infinite loop. This is maybe what you were expecting if you thought done would be placed in a register, but if so, you got the "right" behavior for the wrong reason.
If you enable optimizations, you'll see something else again: the compiler notices that done can't be affected by the call to getcontext (again assuming it's a normal function call) and therefore it is guaranteed to be 0 at the if. So the test need not be done at all, because it will always be true. The swapcontext is then executed unconditionally, and as for done, it's optimized completely out of existence, because it no longer has any effect on the code. You'll again see an infinite loop.
Because of this issue, you really can't make any safe assumptions about local variables that have been modified in between the getcontext and swapcontext. When getcontext returns for the second time, you might or might not see the changes. There are further issues if the compiler chose to reorder some of your code around the function call (which it knows no reason not to do, since again it thinks these are ordinary function calls that can't see your local variables).
The only way to get any certainty is to declare a variable volatile. Then you can be sure that intermediate changes will be seen, and the compiler will not assume that getcontext can't change it. The value seen at the second return of getcontext will be the same as at the call to swapcontext. If you write volatile int done = 0; you ought to see just two "I am running" messages, regardless of other code or optimization settings.

How to stop icc from eliminating function called from inline assembly

Background
I'm making an app that needs to run several tasks concurrently. I can't use threads and such because the app should work without any OS (i.e. straight from the bootsector). Using x86 tasks looks like an overkill (both logically and performance-wise). Thus, I decided to implement a task-switching utility myself. I would save processor state, make a call to the task code and then restore the previous state. So I have to make the call from inline assembly.
Problem
Here's some example code:
#include <stdio.h>
void func() {
printf("Hello, world!\n");
}
void (*funcptr)();
int main() {
funcptr = func;
asm(
"call *%0;"
:
:"r"(funcptr)
);
return 0;
}
It compiles perfectly under icc with no options, gcc and clang and yields "Hello, world!" when run. However, if I compile it with icc main.c -ipo, it segfaults.
I disassembled the code that was generated by icc main.c and got the following:
0000000000401220 <main>:
401220: 55 push %rbp
401221: 48 89 e5 mov %rsp,%rbp
401224: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401228: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40122f: bf 03 00 00 00 mov $0x3,%edi
401234: 33 f6 xor %esi,%esi
401236: e8 45 00 00 00 callq 401280 <__intel_new_feature_proc_init>
40123b: 0f ae 1c 24 stmxcsr (%rsp)
40123f: 48 c7 05 f6 78 00 00 movq $0x401270,0x78f6(%rip) # 408b40 <funcptr>
401246: 70 12 40 00
40124a: b8 70 12 40 00 mov $0x401270,%eax
40124f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401256: 0f ae 14 24 ldmxcsr (%rsp)
40125a: ff d0 callq *%rax
40125c: 33 c0 xor %eax,%eax
40125e: 48 89 ec mov %rbp,%rsp
401261: 5d pop %rbp
401262: c3 retq
401263: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401268: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40126f: 00
0000000000401270 <func>:
401270: bf 04 40 40 00 mov $0x404004,%edi
401275: e9 e6 fd ff ff jmpq 401060 <puts#plt>
40127a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
On the other hand, icc main.c -ipo yields:
0000000000401210 <main>:
401210: 55 push %rbp
401211: 48 89 e5 mov %rsp,%rbp
401214: 48 83 e4 80 and $0xffffffffffffff80,%rsp
401218: 48 81 ec 80 00 00 00 sub $0x80,%rsp
40121f: bf 03 00 00 00 mov $0x3,%edi
401224: 33 f6 xor %esi,%esi
401226: e8 25 00 00 00 callq 401250 <__intel_new_feature_proc_init>
40122b: 0f ae 1c 24 stmxcsr (%rsp)
40122f: 81 0c 24 40 80 00 00 orl $0x8040,(%rsp)
401236: 48 8b 05 cb 2d 00 00 mov 0x2dcb(%rip),%rax # 404008 <funcptr_2.dp.0>
40123d: 0f ae 14 24 ldmxcsr (%rsp)
401241: ff d0 callq *%rax
401243: 33 c0 xor %eax,%eax
401245: 48 89 ec mov %rbp,%rsp
401248: 5d pop %rbp
401249: c3 retq
40124a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
So, while -ipo didn't remove funcptr variable (see address 401236), it did remove assignment. I guess that icc noticed that func is not called from C code so it can be safely removed, so funcptr is allowed to contain garbage. However, it didn't notice that I'm calling func indirectly via assembly.
What I tried
Replacing "r"(funcptr) with "r"(func) works but I can't hardcode a specific function (see background).
Calling funcptr and/or func before and/or after inline assembly block don't help because icc just inlines printf("Hello, world!\n");.
I can't get rid of inline assembly because I have to do low-level register, flags and stack manipulation before and after call.
Making funcptr volatile yields the following warning but still segfaults:
a value of type "void (*)()" cannot be assigned to an entity of type "volatile void (*)()"
Adding volatile to almost every other word doesn't help either.
Moving func and/or funcptr to other source files and then linking them together doesn't help.
Moving inline assembly to a separate function doesn't work.
Am I doing something wrong or is it an icc bug? If the former, how do I fix the code? If the latter, is there any workaround and should I report the bug?
$ icc --version
icc (ICC) 19.1.0.166 20191121
Copyright (C) 1985-2019 Intel Corporation. All rights reserved.

GCC optimizer generating error in nostdlib code

I have the following code:
void cp(void *a, const void *b, int n) {
for (int i = 0; i < n; ++i) {
((char *) a)[i] = ((const char *) b)[i];
}
}
void _start(void) {
char buf[20];
const char m[] = "123456789012345";
cp(buf, m, 15);
register int rax __asm__ ("rax") = 60; // exit
register int rdi __asm__ ("rdi") = 0; // status
__asm__ volatile (
"syscall" :: "r" (rax), "r" (rdi) : "cc", "rcx", "r11"
);
__builtin_unreachable();
}
If I compile it with gcc -nostdlib -O1 "./a.c" -o "./a", I get a functioning program, but if I compile it with -O2, I get a program that generates a segmentation fault.
This is the generated code with -O1:
0000000000001000 <cp>:
1000: b8 00 00 00 00 mov $0x0,%eax
1005: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
1009: 88 14 07 mov %dl,(%rdi,%rax,1)
100c: 48 83 c0 01 add $0x1,%rax
1010: 48 83 f8 0f cmp $0xf,%rax
1014: 75 ef jne 1005 <cp+0x5>
1016: c3 retq
0000000000001017 <_start>:
1017: 48 83 ec 30 sub $0x30,%rsp
101b: 48 b8 31 32 33 34 35 movabs $0x3837363534333231,%rax
1022: 36 37 38
1025: 48 ba 39 30 31 32 33 movabs $0x35343332313039,%rdx
102c: 34 35 00
102f: 48 89 04 24 mov %rax,(%rsp)
1033: 48 89 54 24 08 mov %rdx,0x8(%rsp)
1038: 48 89 e6 mov %rsp,%rsi
103b: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
1040: ba 0f 00 00 00 mov $0xf,%edx
1045: e8 b6 ff ff ff callq 1000 <cp>
104a: b8 3c 00 00 00 mov $0x3c,%eax
104f: bf 00 00 00 00 mov $0x0,%edi
1054: 0f 05 syscall
And this is the generated code with -O2:
0000000000001000 <cp>:
1000: 31 c0 xor %eax,%eax
1002: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1008: 0f b6 14 06 movzbl (%rsi,%rax,1),%edx
100c: 88 14 07 mov %dl,(%rdi,%rax,1)
100f: 48 83 c0 01 add $0x1,%rax
1013: 48 83 f8 0f cmp $0xf,%rax
1017: 75 ef jne 1008 <cp+0x8>
1019: c3 retq
101a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000001020 <_start>:
1020: 48 8d 44 24 d8 lea -0x28(%rsp),%rax
1025: 48 8d 54 24 c9 lea -0x37(%rsp),%rdx
102a: b9 31 00 00 00 mov $0x31,%ecx
102f: 66 0f 6f 05 c9 0f 00 movdqa 0xfc9(%rip),%xmm0 # 2000 <_start+0xfe0>
1036: 00
1037: 48 8d 70 0f lea 0xf(%rax),%rsi
103b: 0f 29 44 24 c8 movaps %xmm0,-0x38(%rsp)
1040: eb 0d jmp 104f <_start+0x2f>
1042: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1048: 0f b6 0a movzbl (%rdx),%ecx
104b: 48 83 c2 01 add $0x1,%rdx
104f: 88 08 mov %cl,(%rax)
1051: 48 83 c0 01 add $0x1,%rax
1055: 48 39 f0 cmp %rsi,%rax
1058: 75 ee jne 1048 <_start+0x28>
105a: b8 3c 00 00 00 mov $0x3c,%eax
105f: 31 ff xor %edi,%edi
1061: 0f 05 syscall
The crash happens at 103b, instruction movaps %xmm0,-0x38(%rsp).
I noticed that if m contains less than 15 characters, then the generated code is different and the crash does not happen.
What am I doing wrong?
_start is not a function. It's not called by anything, and on entry the stack is 16-byte aligned, not (as the ABI requires) 8 bytes away from 16-byte alignment.
(The ABI requires 16-byte alignment before a call, and call pushes an 8-byte return address. So on function entry RSP-8 and RSP+8 are 16-byte aligned.)
At -O2 GCC uses alignment-required 16-byte instructions to implement the copy done by cp(), copying the "123456789012345" from static storage to the stack.
At -O1, GCC just uses two mov r64, imm64 instructions to get bytes into integer regs for 8-byte stores. These don't require alignment.
Workarounds
Just write a main in C like a normal person if you want everything to work.
Or if you're trying to microbenchmark something light-weight in asm, you can use gcc -nostdlib -O3 -mincoming-stack-boundary=3 (docs) to tell GCC that functions can't assume they're called with more than 8-byte alignment. Unlike -mpreferred-stack-boundary=3, this will still align by 16 before making further calls. So if you have other non-leaf functions, you might want to just use an attribute on your hacky C _start() instead of affecting the whole file.
A worse, more hacky way would be to try putting
asm("push %rax"); at the very top of _start to modify RSP by 8, where GCC hopefully runs it before doing anything else with the stack. GNU C Basic asm statements are implicitly volatile so you don't need asm volatile, although that wouldn't hurt.
You're 100% on your own and responsible for correctly tricking the compiler by using inline asm that works for whatever optimization level you're using.
Another safer way would be write your own light-weight _start that calls main:
// at global scope:
asm(
".globl _start \n"
"_start: \n"
" mov (%rsp), %rdi \n" // argc
" lea 8(%rsp), %rsi \n" // argv
" lea 8(%rsi, %rdi, 8), %rdx \n" // envp
" call main \n"
// NOT DONE: stdio cleanup or other atexit stuff
// DO NOT USE WITH GLIBC; use libc's CRT code if you use libc
" mov %eax, %edi \n"
" mov $231, %eax \n"
" syscall" // exit_group( main() )
);
int main(int argc, char**argv, char**envp) {
... your code here
return 0;
}
If you didn't want main to return, you could just pop %rdi; mov %rsp, %rsi ; jmp main to give it argc and argv without a return address.
Then main can exit via inline asm, or by calling exit() or _exit() if you link libc. (But if you link libc, you should usually use its _start.)
See also: How Get arguments value using inline assembly in C without Glibc? for other hand-rolled _start versions; this is pretty much like #zwol's there.

Understanding array declaration in C

I'm trying to understand how the C Standard explains that the declaration can cause an error. Consider the following pretty simple code:
int main()
{
char test[1024 * 1024 * 1024];
test[0] = 0;
return 0;
}
Demo
This segfaluts. But the following code does not:
int main()
{
char test[1024 * 1024 * 1024];
return 0;
}
Demo
But when I compiled it on my machine the latest one segfaulted too. The main function looks as
00000000000008c6 <main>:
8c6: 55 push %rbp
8c7: 48 89 e5 mov %rsp,%rbp
8ca: 48 81 ec 20 00 00 40 sub $0x40000020,%rsp
8d1: 89 bd ec ff ff bf mov %edi,-0x40000014(%rbp) // <---HERE
8d7: 48 89 b5 e0 ff ff bf mov %rsi,-0x40000020(%rbp)
8de: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
8e5: 00 00
8e7: 48 89 45 f8 mov %rax,-0x8(%rbp)
8eb: 31 c0 xor %eax,%eax
8ed: b8 00 00 00 00 mov $0x0,%eax
8f2: 48 8b 55 f8 mov -0x8(%rbp),%rdx
8f6: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
8fd: 00 00
8ff: 74 05 je 906 <main+0x40>
901: e8 1a fe ff ff callq 720 <__stack_chk_fail#plt>
906: c9 leaveq
907: c3 retq
908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
90f: 00
As far as I understood the segfault occurred when trying to mov %edi,-0x40000014(%rbp).
I tried to find the exaplanation in the N1570, Section 6.7.9 Initialization, but it does not seem to be the relevant one.
So how does the Standard explains this behavior?
The result is implementation-dependent
I can think of several reasons of why the behaviour should differ
compiler seeing that variable isn't used, no possible side-effect, and optimizing it away (even without optimization levels)
stack resizing on request. Since there are no writes to this variable yet, why resizing the stack now?
compilers don't have to use the stack for auto memory. Compiler can allocate memory using malloc, and free it on exit. Using heap would allow to allocate 1Gb without issues
stack size set at 1Gb :)

Why does this code prevent gcc & llvm from tail-call optimization?

I have tried the following code on gcc 4.4.5 on Linux and gcc-llvm on Mac OSX(Xcode 4.2.1) and this. The below are the source and the generated disassembly of the relevant functions. (Added: compiled with gcc -O2 main.c)
#include <stdio.h>
__attribute__((noinline))
static void g(long num)
{
long m, n;
printf("%p %ld\n", &m, n);
return g(num-1);
}
__attribute__((noinline))
static void h(long num)
{
long m, n;
printf("%ld %ld\n", m, n);
return h(num-1);
}
__attribute__((noinline))
static void f(long * num)
{
scanf("%ld", num);
g(*num);
h(*num);
return f(num);
}
int main(void)
{
printf("int:%lu long:%lu unsigned:%lu\n", sizeof(int), sizeof(long), sizeof(unsigned));
long num;
f(&num);
return 0;
}
08048430 <g>:
8048430: 55 push %ebp
8048431: 89 e5 mov %esp,%ebp
8048433: 53 push %ebx
8048434: 89 c3 mov %eax,%ebx
8048436: 83 ec 24 sub $0x24,%esp
8048439: 8d 45 f4 lea -0xc(%ebp),%eax
804843c: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
8048443: 00
8048444: 89 44 24 04 mov %eax,0x4(%esp)
8048448: c7 04 24 d0 85 04 08 movl $0x80485d0,(%esp)
804844f: e8 f0 fe ff ff call 8048344 <printf#plt>
8048454: 8d 43 ff lea -0x1(%ebx),%eax
8048457: e8 d4 ff ff ff call 8048430 <g>
804845c: 83 c4 24 add $0x24,%esp
804845f: 5b pop %ebx
8048460: 5d pop %ebp
8048461: c3 ret
8048462: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
8048469: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
08048470 <h>:
8048470: 55 push %ebp
8048471: 89 e5 mov %esp,%ebp
8048473: 83 ec 18 sub $0x18,%esp
8048476: 66 90 xchg %ax,%ax
8048478: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
804847f: 00
8048480: c7 44 24 04 00 00 00 movl $0x0,0x4(%esp)
8048487: 00
8048488: c7 04 24 d8 85 04 08 movl $0x80485d8,(%esp)
804848f: e8 b0 fe ff ff call 8048344 <printf#plt>
8048494: eb e2 jmp 8048478 <h+0x8>
8048496: 8d 76 00 lea 0x0(%esi),%esi
8048499: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi
080484a0 <f>:
80484a0: 55 push %ebp
80484a1: 89 e5 mov %esp,%ebp
80484a3: 53 push %ebx
80484a4: 89 c3 mov %eax,%ebx
80484a6: 83 ec 14 sub $0x14,%esp
80484a9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
80484b0: 89 5c 24 04 mov %ebx,0x4(%esp)
80484b4: c7 04 24 e1 85 04 08 movl $0x80485e1,(%esp)
80484bb: e8 94 fe ff ff call 8048354 <__isoc99_scanf#plt>
80484c0: 8b 03 mov (%ebx),%eax
80484c2: e8 69 ff ff ff call 8048430 <g>
80484c7: 8b 03 mov (%ebx),%eax
80484c9: e8 a2 ff ff ff call 8048470 <h>
80484ce: eb e0 jmp 80484b0 <f+0x10>
We can see that g() and h() are mostly identical except the & (address of) operator beside the argument m of printf()(and the irrelevant %ld and %p).
However, h() is tail-call optimized and g() is not. Why?
In g(), you're taking the address of a local variable and passing it to a function. A "sufficiently smart compiler" should realize that printf does not store that pointer. Instead, gcc and llvm assume that printf might store the pointer somewhere, so the call frame containing m might need to be "live" further down in the recursion. Therefore, no TCO.
It's the & that does it. It tells the compiler that m should be stored on the stack. Even though it is passed to printf, the compiler has to assume that it might be accessed by somebody else and thus must the cleaned from the stack after the call to g.
In this particular case, as printf is known by the compiler (and it knows that it does not save pointers), it could probably be taught to perform this optimization.
For more info on this, look up 'escape anlysis'.

Resources