gcc-10.0.1 Specific Segfault

gcc-10.0.1 Specific Segfault - c

I have an R package with C compiled code that's been relatively stable for quite a while and is frequently tested against a broad variety of platforms and compilers (windows/osx/debian/fedora gcc/clang).
More recently a new platform was added to test the package again:
Logs from checks with gcc trunk aka 10.0.1 compiled from source
on Fedora 30. (For some archived packages, 10.0.0.)
x86_64 Fedora 30 Linux
FFLAGS="-g -O2 -mtune=native -Wall -fallow-argument-mismatch"
CFLAGS="-g -O2 -Wall -pedantic -mtune=native -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection"
CXXFLAGS="-g -O2 -Wall -pedantic -mtune=native -Wno-ignored-attributes -Wno-deprecated-declarations -Wno-parentheses -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection"
At which point the compiled code promptly started segfaulting along these lines:
*** caught segfault ***
address 0x1d00000001, cause 'memory not mapped'
I've been able to reproduce the segfault consistently by using the rocker/r-base docker container with gcc-10.0.1 with optimization level -O2. Running a lower optimization gets rid of the problem. Running any other set-up, including under valgrind (both -O0 and -O2), UBSAN (gcc/clang), shows no problems at all. I'm also reasonably sure this ran under gcc-10.0.0, but don't have the data.
I ran the gcc-10.0.1 -O2 version with gdb and noticed something that seems odd to me:
While stepping through the highlighted section it appears the initialization of the second elements of the arrays is skipped (R_alloc is a wrapper around malloc that self garbage collects when returning control to R; the segfault happens before return to R). Later, the program crashes when the un-initialized element (in the gcc.10.0.1 -O2 version) is accessed.
I fixed this by explicitly initializing the element in question everywhere in the code that eventually led to the usage of the element, but it really should have been initialized to an empty string, or at least that's what I would have assumed.
Am I missing something obvious or doing something stupid? Both are reasonably likely as C is my second language by far. It's just strange that this just cropped up now, and I can't figure out what the compiler is trying to do.
UPDATE: Instructions to reproduce this, although this will only reproduce so long as debian:testing docker container has gcc-10 at gcc-10.0.1. Also, don't just run these commands if you don't trust me.
Sorry this is not a minimal reproducible example.
docker pull rocker/r-base
docker run --rm -ti --security-opt seccomp=unconfined \
rocker/r-base /bin/bash
apt-get update
apt-get install gcc-10 gdb
gcc-10 --version # confirm 10.0.1
# gcc-10 (Debian 10-20200222-1) 10.0.1 20200222 (experimental)
# [master revision 01af7e0a0c2:487fe13f218:e99b18cf7101f205bfdd9f0f29ed51caaec52779]
mkdir ~/.R
touch ~/.R/Makevars
echo "CC = gcc-10
CFLAGS = -g -O2 -Wall -pedantic -mtune=native -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection
" >> ~/.R/Makevars
R -d gdb --vanilla
Then in the R console, after typing run to get gdb to run the program:
f.dl <- tempfile()
f.uz <- tempfile()
github.url <- 'https://github.com/brodieG/vetr/archive/v0.2.8.zip'
download.file(github.url, f.dl)
unzip(f.dl, exdir=f.uz)
install.packages(
file.path(f.uz, 'vetr-0.2.8'), repos=NULL,
INSTALL_opts="--install-tests", type='source'
)
# minimal set of commands to segfault
library(vetr)
alike(pairlist(a=1, b="character"), pairlist(a=1, b=letters))
alike(pairlist(1, "character"), pairlist(1, letters))
alike(NULL, 1:3) # not a wild card at top level
alike(list(NULL), list(1:3)) # but yes when nested
alike(list(NULL, NULL), list(list(list(1, 2, 3)), 1:25))
alike(list(NULL), list(1, 2))
alike(list(), list(1, 2))
alike(matrix(integer(), ncol=7), matrix(1:21, nrow=3))
alike(matrix(character(), nrow=3), matrix(1:21, nrow=3))
alike(
matrix(integer(), ncol=3, dimnames=list(NULL, c("R", "G", "B"))),
matrix(1:21, ncol=3, dimnames=list(NULL, c("R", "G", "B")))
)
# Adding tests from docs
mx.tpl <- matrix(
integer(), ncol=3, dimnames=list(row.id=NULL, c("R", "G", "B"))
)
mx.cur <- matrix(
sample(0:255, 12), ncol=3, dimnames=list(row.id=1:4, rgb=c("R", "G", "B"))
)
mx.cur2 <-
matrix(sample(0:255, 12), ncol=3, dimnames=list(1:4, c("R", "G", "B")))
alike(mx.tpl, mx.cur2)
Inspecting in gdb pretty quickly shows (if I understand correctly) that
CSR_strmlen_x is trying to access the string that was not initialized.
UPDATE 2: this is a highly recursive function, and on top of that the string initialization bit gets called many, many times. This is mostly b/c I was being lazy, we only need the strings initialized for the one time we actually encounter something we want to report in the recursion, but it was easier to initialize every time it is possible to encounter something. I mention this because what you'll see next shows multiple initializations, but only one of them (presumably the one with address <0x1400000001>) is being used.
I can't guarantee that the stuff I'm showing here is directly related to the element that caused the segfault (though it is the same illegal address acccess), but as #nate-eldredge asked it does show that the array element is not initialized either just before return or just after return in the calling function. Note the calling function is initializing 8 of these, and I show them all, with all them filled with either garbage or inaccessible memory.
UPDATE 3, disassembly of function in question:
Breakpoint 1, ALIKEC_res_strings_init () at alike.c:75
75 return res;
(gdb) p res.current[0]
$1 = 0x7ffff46a0aa5 "%s%s%s%s"
(gdb) p res.current[1]
$2 = 0x1400000001 <error: Cannot access memory at address 0x1400000001>
(gdb) disas /m ALIKEC_res_strings_init
Dump of assembler code for function ALIKEC_res_strings_init:
53 struct ALIKEC_res_strings ALIKEC_res_strings_init() {
0x00007ffff4687fc0 <+0>: endbr64
54 struct ALIKEC_res_strings res;
55
56 res.target = (const char **) R_alloc(5, sizeof(const char *));
0x00007ffff4687fc4 <+4>: push %r12
0x00007ffff4687fc6 <+6>: mov $0x8,%esi
0x00007ffff4687fcb <+11>: mov %rdi,%r12
0x00007ffff4687fce <+14>: push %rbx
0x00007ffff4687fcf <+15>: mov $0x5,%edi
0x00007ffff4687fd4 <+20>: sub $0x8,%rsp
0x00007ffff4687fd8 <+24>: callq 0x7ffff4687180 <R_alloc#plt>
0x00007ffff4687fdd <+29>: mov $0x8,%esi
0x00007ffff4687fe2 <+34>: mov $0x5,%edi
0x00007ffff4687fe7 <+39>: mov %rax,%rbx
57 res.current = (const char **) R_alloc(5, sizeof(const char *));
0x00007ffff4687fea <+42>: callq 0x7ffff4687180 <R_alloc#plt>
58
59 res.target[0] = "%s%s%s%s";
0x00007ffff4687fef <+47>: lea 0x1764a(%rip),%rdx # 0x7ffff469f640
0x00007ffff4687ff6 <+54>: lea 0x18aa8(%rip),%rcx # 0x7ffff46a0aa5
0x00007ffff4687ffd <+61>: mov %rcx,(%rbx)
60 res.target[1] = "";
61 res.target[2] = "";
0x00007ffff4688000 <+64>: mov %rdx,0x10(%rbx)
62 res.target[3] = "";
0x00007ffff4688004 <+68>: mov %rdx,0x18(%rbx)
63 res.target[4] = "";
0x00007ffff4688008 <+72>: mov %rdx,0x20(%rbx)
64
65 res.tar_pre = "be";
66
67 res.current[0] = "%s%s%s%s";
0x00007ffff468800c <+76>: mov %rax,0x8(%r12)
0x00007ffff4688011 <+81>: mov %rcx,(%rax)
68 res.current[1] = "";
69 res.current[2] = "";
0x00007ffff4688014 <+84>: mov %rdx,0x10(%rax)
70 res.current[3] = "";
0x00007ffff4688018 <+88>: mov %rdx,0x18(%rax)
71 res.current[4] = "";
0x00007ffff468801c <+92>: mov %rdx,0x20(%rax)
72
73 res.cur_pre = "is";
74
75 return res;
=> 0x00007ffff4688020 <+96>: lea 0x14fe0(%rip),%rax # 0x7ffff469d007
0x00007ffff4688027 <+103>: mov %rax,0x10(%r12)
0x00007ffff468802c <+108>: lea 0x14fcd(%rip),%rax # 0x7ffff469d000
0x00007ffff4688033 <+115>: mov %rbx,(%r12)
0x00007ffff4688037 <+119>: mov %rax,0x18(%r12)
0x00007ffff468803c <+124>: add $0x8,%rsp
0x00007ffff4688040 <+128>: pop %rbx
0x00007ffff4688041 <+129>: mov %r12,%rax
0x00007ffff4688044 <+132>: pop %r12
0x00007ffff4688046 <+134>: retq
0x00007ffff4688047: nopw 0x0(%rax,%rax,1)
End of assembler dump.
UPDATE 4:
So, trying to parse through the standard here are the parts of it that seem relevant (C11 draft):
6.3.2.3 Par7 Conversions > Other Operands > Pointers
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned 68) for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type,the result points to the
lowest addressed byte of the object. Successive increments of
the result, up to the size of the object, yield pointers to the
remaining bytes of the object.
6.5 Par6 Expressions
The effective type of an object for an access to its stored value is the
declared type of the object, if any. 87) If a value is stored into
an object having no declared type through an lvalue having a
type that is not a character type, then the type of the lvalue becomes
the effective type of the object for that access and for subsequent
accesses that do not modify the stored value. If a value is
copied into an object having no declared type
using memcpy or memmove, or is copied as an array of character type, then
the effective type of the modified object for that access and for
subsequent accesses that do not modify the value is the effective type
of the object from which the value is copied, if it has one. For all
other accesses to an object having no declared type, the effective
type of the object is simply the type of the lvalue used for the
access.
87) Allocated objects have no declared type.
IIUC R_alloc returns an offset into a malloced block that is guaranteed to be double aligned, and the size of the block after the offset is of the requested size (there is also allocation before the offset for R specific data). R_alloc casts that pointer to (char *) on return.
Section 6.2.5 Par 29
A pointer to void shall have the same representation and
alignment requirements as a pointer to a character
type. 48) Similarly, pointers to qualified or unqualified versions
of compatible types shall have the same representation and
alignment requirements. All pointers to structure types shall have
the same representation and alignment requirements as each other.
All pointers to union types shall have the same
representation and alignment requirements as each other.
Pointers to other types need not have the same representation
or alignment requirements.
48) The same representation and alignment requirements are meant to imply interchangeability asarguments to functions, return values from functions, and members of unions.
So the question is "are we allowed to recast the (char *) to (const char **) and write to it as (const char **)". My reading of the above is that so long as pointers on the systems the code run in have alignment compatible with double alignment, then its okay.
Are we violating "strict aliasing"? i.e.:
6.5 Par 7
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types: 88)
— a type compatible with the effective type of the object
...
88) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
So, what should the compiler think the effective type of the object pointed to by res.target (or res.current) is? Presumably the declared type (const char **), or is this actually ambiguous? It feels to me that it isn't in this case only because there is no other 'lvalue' in scope that accesses the same object.
I'll admit I'm struggling mightily to extract sense from these sections of the standard.

Summary: This appears to be a bug in gcc, related to string optimization. A self-contained testcase is below. There was initially some doubt as to whether the code is correct, but I think it is.
I have reported the bug as PR 93982. A proposed fix was committed but it does not fix it in all cases, leading to the followup PR 94015 (godbolt link).
You should be able to work around the bug by compiling with the flag -fno-optimize-strlen.
I was able to reduce your test case to the following minimal example (also on godbolt):
struct a {
const char ** target;
};
char* R_alloc(void);
struct a foo(void) {
struct a res;
res.target = (const char **) R_alloc();
res.target[0] = "12345678";
res.target[1] = "";
res.target[2] = "";
res.target[3] = "";
res.target[4] = "";
return res;
}
With gcc trunk (gcc version 10.0.1 20200225 (experimental)) and -O2 (all other options turned out to be unnecessary), the generated assembly on amd64 is as follows:
.LC0:
.string "12345678"
.LC1:
.string ""
foo:
subq $8, %rsp
call R_alloc
movq $.LC0, (%rax)
movq $.LC1, 16(%rax)
movq $.LC1, 24(%rax)
movq $.LC1, 32(%rax)
addq $8, %rsp
ret
So you are quite right that the compiler is failing to initialize res.target[1] (note the conspicuous absence of movq $.LC1, 8(%rax)).
It is interesting to play with the code and see what affects the "bug". Perhaps significantly, changing the return type of R_alloc to void * makes it go away, and gives you "correct" assembly output. Maybe less significantly but more amusingly, changing the string "12345678" to be either longer or shorter also makes it go away.
Previous discussion, now resolved - the code is apparently legal.
The question I have is whether your code is actually legal. The fact that you take the char * returned by R_alloc() and cast it to const char **, and then store a const char * seems like it might violate the strict aliasing rule, as char and const char * are not compatible types. There is an exception that allows you to access any object as char (to implement things like memcpy), but this is the other way around, and as best I understand it, that's not allowed. It makes your code produce undefined behavior and so the compiler can legally do whatever the heck it wants.
If this is so, the correct fix would be for R to change their code so that R_alloc() returns void * instead of char *. Then there would be no aliasing problem. Unfortunately, that code is outside your control, and it's not clear to me how you can use this function at all without violating strict aliasing. A workaround might be to interpose a temporary variable, e.g. void *tmp = R_alloc(); res.target = tmp; which solves the problem in the test case, but I'm still not sure if it's legal.
However, I am not sure of this "strict aliasing" hypothesis, because compiling with -fno-strict-aliasing, which AFAIK is supposed to make gcc allow such constructs, does not make the problem go away!
Update. Trying some different options, I found that either -fno-optimize-strlen or -fno-tree-forwprop will result in "correct" code being generated. Also, using -O1 -foptimize-strlen yields the incorrect code (but -O1 -ftree-forwprop does not).
After a little git bisect exercise, the error seems to have been introduced in commit 34fcf41e30ff56155e996f5e04.
Update 2. I tried digging into the gcc source a little bit, just to see what I could learn. (I don't claim to be any sort of compiler expert!)
It looks like the code in tree-ssa-strlen.c is meant to keep track of strings appearing in the program. As near as I can tell, the bug is that in looking at the statement res.target[0] = "12345678"; the compiler conflates the address of the string literal "12345678" with the string itself. (That seems to be related to this suspicious code which was added in the aforementioned commit, where if it tries to count the bytes of a "string" that is actually an address, it instead looks at what that address points to.)
So it thinks that the statement res.target[0] = "12345678", instead of storing the address of "12345678" at the address res.target, is storing the string itself at that address, as if the statement were strcpy(res.target, "12345678"). Note for what's ahead that this would result in the trailing nul being stored at address res.target+8 (at this stage in the compiler, all offsets are in bytes).
Now when the compiler looks at res.target[1] = "", it likewise treats this as if it were strcpy(res.target+8, ""), the 8 coming from the size of a char *. That is, as if it were simply storing a nul byte at address res.target+8. However, the compiler "knows" that the previous statement already stored a nul byte at that very address! As such, this statement is "redundant" and can be discarded (here).
This explains why the string has to be exactly 8 characters long to trigger the bug. (Though other multiples of 8 can also trigger the bug in other situations.)

Related

Does gcc cache the address of a global array at a static index by default?

Let's say I have the global array
char global_array[3] = {33, 17, 6}; and I access global_array[1]. When gcc interprets global_array[1], will it add 1 to the address of global_array at runtime, or change global_array[1] to a static address, assuming -O0? If not, is there a compiler flag to for gcc to make the static index of an array a static pointer?

When gcc interprets global_array[1], will it add 1 to the address of global_array at runtime, or change global_array[1] to a static address, assuming -O0?
There's no particular guarantees of what gcc does and does not depending on level of optimization. Take your example under gcc 12.2 for x86_64:
int func (void)
{
char global_array[3] = {33, 17, 6};
return global_array[1];
}
This results in peculiar assembly code under -O0:
func:
push rbp
mov rbp, rsp
mov WORD PTR [rbp-3], 4385
mov BYTE PTR [rbp-1], 6
movzx eax, BYTE PTR [rbp-2]
movsx eax, al
pop rbp
ret
It is storing the array of the stack but it loads the magic number 4385 as a word, then loads a 6 at the end. Now wth is 4385? It is 0x1121 hex. Where the bytes have values 0x11 = 17dec and 0x21 = 33, and since it is a x86 little endian, the byte order for the WORD instruction is reversed vs the order of the array.
This word write is already some manner of mini-optimization even though I used -O0. So again - no guarantees. Regarding your question, the access part rbp-2 is indeed similar to adding 1 the to the address in runtime. In this case it is subtracting an offset from the stack frame pointer, but that's essentially the same thing (and likely as fast/slow as adding 1 to an address).
The normal thing when optimizations are enabled would otherwise be to replace the whole thing with the absolute value.
If not, is there a compiler flag to for gcc to make the static index of an array a static pointer?
Messing around with compiler options is probably not how. If you wish to have a pointer to a fixed address, then this is the usual steps:
Create a custom segment at a fixed address in the linker script for the specific target.
Declare your array at file scope and use gcc variable attributes to allocate it inside your custom segment.
Now &global_array[1] is always the same address.

How Get arguments value using inline assembly in C without Glibc?

How Get arguments value using inline assembly in C without Glibc?
i require this code for Linux archecture x86_64 and i386.
if you know about MAC OS X or Windows , also submit and please guide.
void exit(int code)
{
//This function not important!
//...
}
void _start()
{
//How Get arguments value using inline assembly
//in C without Glibc?
//argc
//argv
exit(0);
}
New Update
https://gist.github.com/apsun/deccca33244471c1849d29cc6bb5c78e
and
#define ReadRdi(To) asm("movq %%rdi,%0" : "=r"(To));
#define ReadRsi(To) asm("movq %%rsi,%0" : "=r"(To));
long argcL;
long argvL;
ReadRdi(argcL);
ReadRsi(argvL);
int argc = (int) argcL;
//char **argv = (char **) argvL;
exit(argc);
But it still returns 0.
So this code is wrong!
please help.

As specified in the comment, argc and argv are provided on the stack, so you cannot use a regular C function to get them, even with inline assembly, as the compiler will touch the stack pointer to allocate the local variables, setup the stack frame & co.; hence, _start must be written in assembly, as it's done in glibc (x86; x86_64). A small stub can be written to just grab the stuff and forward it to your "real" C entrypoint according to the regular calling convention.
Here a minimal example of a program (both for x86 and x86_64) that reads argc and argv, prints all the values in argv on stdout (separated by newline) and exits using argc as status code; it can be compiled with the usual gcc -nostdlib (and -static to make sure ld.so isn't involved; not that it does any harm here).
#ifdef __x86_64__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movq 0(%rsp),%rdi\n" // get argc
" lea 8(%rsp),%rsi\n" // the arguments are pushed just below, so argv = %rbp + 8
" call bare_main\n" // call our bare_main
" movq %rax,%rdi\n" // take the main return code and use it as first argument for...
" movl $60,%eax\n" // ... the exit syscall
" syscall\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; the calling convention is pretty much ok as is
" movq $1,%rax\n" // 1 = write syscall on x86_64
" syscall\n"
" ret\n");
#endif
#ifdef __i386__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movl 0(%esp),%edi\n" // argc is on the top of the stack
" lea 4(%esp),%esi\n" // as above, but with 4-byte pointers
" sub $8,%esp\n" // the start starts 16-byte aligned, we have to push 2*4 bytes; "waste" 8 bytes
" pushl %esi\n" // to keep it aligned after pushing our arguments
" pushl %edi\n"
" call bare_main\n" // call our bare_main
" add $8,%esp\n" // fix the stack after call (actually useless here)
" movl %eax,%ebx\n" // take the main return code and use it as first argument for...
" movl $1,%eax\n" // ... the exit syscall
" int $0x80\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; convert the user-mode calling convention to the syscall convention
" pushl %ebx\n" // ebx is callee-preserved
" movl 8(%esp),%ebx\n" // just move stuff from the stack to the correct registers
" movl 12(%esp),%ecx\n"
" movl 16(%esp),%edx\n"
" mov $4,%eax\n" // 4 = write syscall on i386
" int $0x80\n"
" popl %ebx\n" // restore ebx
" ret\n"); // notice: the return value is already ok in %eax
#endif
int bare_write(int fd, const void *buf, unsigned count);
unsigned my_strlen(const char *ch) {
const char *ptr;
for(ptr = ch; *ptr; ++ptr);
return ptr-ch;
}
int bare_main(int argc, char *argv[]) {
for(int i = 0; i < argc; ++i) {
int len = my_strlen(argv[i]);
bare_write(1, argv[i], len);
bare_write(1, "\n", 1);
}
return argc;
}
Notice that here several subtleties are ignored - in particular, the atexit bit. All the documentation about the machine-specific startup state has been extracted from the comments in the two glibc files linked above.

This answer is for x86-64, 64-bit Linux ABI, only. All the other OSes and ABIs mentioned will be broadly similar, but different enough in the fine details that you will need to write your custom _start once for each.
You are looking for the specification of the initial process state in the "x86-64 psABI", or, to give it its full title, "System V Application Binary Interface, AMD64 Architecture Processor Supplement (With LP64 and ILP32 Programming Models)". I will reproduce figure 3.9, "Initial Process Stack", here:
Purpose Start Address Length
------------------------------------------------------------------------
Information block, including varies
argument strings, environment
strings, auxiliary information
...
------------------------------------------------------------------------
Null auxiliary vector entry 1 eightbyte
Auxiliary vector entries... 2 eightbytes each
0 eightbyte
Environment pointers... 1 eightbyte each
0 8+8*argc+%rsp eightbyte
Argument pointers... 8+%rsp argc eightbytes
Argument count %rsp eightbyte
It goes on to say that the initial registers are unspecified except
for %rsp, which is of course the stack pointer, and %rdx, which may contain "a function pointer to register with atexit".
So all the information you are looking for is already present in memory, but it hasn't been laid out according to the normal calling convention, which means you must write _start in assembly language. It is _start's responsibility to set everything up to call main with, based on the above. A minimal _start would look something like this:
_start:
xorl %ebp, %ebp # mark the deepest stack frame
# Current Linux doesn't pass an atexit function,
# so you could leave out this part of what the ABI doc says you should do
# You can't just keep the function pointer in a call-preserved register
# and call it manually, even if you know the program won't call exit
# directly, because atexit functions must be called in reverse order
# of registration; this one, if it exists, is meant to be called last.
testq %rdx, %rdx # is there "a function pointer to
je skip_atexit # register with atexit"?
movq %rdx, %rdi # if so, do it
call atexit
skip_atexit:
movq (%rsp), %rdi # load argc
leaq 8(%rsp), %rsi # calc argv (pointer to the array on the stack)
leaq 8(%rsp,%rdi,8), %rdx # calc envp (starts after the NULL terminator for argv[])
call main
movl %eax, %edi # pass return value of main to exit
call exit
hlt # should never get here
(Completely untested.)
(In case you're wondering why there's no adjustment to maintain stack pointer alignment, this is because upon a normal procedure call, 8(%rsp) is 16-byte aligned, but when _start is called, %rsp itself is 16-byte aligned. Each call instruction displaces %rsp down by eight, producing the alignment situation expected by normal compiled functions.)
A more thorough _start would do more things, such as clearing all the other registers, arranging for greater stack pointer alignment than the default if desired, calling into the C library's own initialization functions, setting up environ, initializing the state used by thread-local storage, doing something constructive with the auxiliary vector, etc.
You should also be aware that if there is a dynamic linker (PT_INTERP section in the executable), it receives control before _start does. Glibc's ld.so cannot be used with any C library other than glibc itself; if you are writing your own C library, and you want to support dynamic linkage, you will also need to write your own ld.so. (Yes, this is unfortunate; ideally, the dynamic linker would be a separate development project and its complete interface would be specified.)

As a quick and dirty hack, you can make an executable with a compiled C function as the ELF entry point. Just make sure you use exit or _exit instead of returning.
(Link with gcc -nostartfiles to omit CRT but still link other libraries, and write a _start() in C. Beware of ABI violations like stack alignment, e.g. use -mincoming-stack-boundary=2 or an __attribte__ on _start, as in Compiling without libc)
If it's dynamically linked, you can still use glibc functions on Linux (because the dynamic linker runs glibc's init functions). Not all systems are like this, e.g. on cygwin you definitely can't call libc functions if you (or the CRT start code) hasn't called the libc init functions in the correct order. I'm not sure it's even guaranteed that this works on Linux, so don't depend on it except for experimentation on your own system.
I have used a C _start(void){ ... } + calling _exit() for making a static executable to microbenchmark some compiler-generated code with less startup overhead for perf stat ./a.out.
Glibc's _exit() works even if glibc wasn't initialized (gcc -O3 -static), or use inline asm to run xor %edi,%edi / mov $60, %eax / syscall (sys_exit(0) on Linux) so you don't have to even statically link libc. (gcc -O3 -nostdlib)
With even more dirty hacking and UB, you can access argc and argv by knowing the x86-64 System V ABI that you're compiling for (see #zwol's answer for a quote from ABI doc), and how the process startup state differers from the function calling convention:
argc is where the return address would be for a normal function (pointed to by RSP). GNU C has a builtin for accessing the return address of the current function (or for walking up the stack.)
argv[0] is where the 7th integer/pointer arg should be (the first stack arg, just above the return address). It happens to / seems to work to take its address and use that as an array!
// Works only for the x86-64 SystemV ABI; only tested on Linux.
// DO NOT USE THIS EXCEPT FOR EXPERIMENTS ON YOUR OWN COMPUTER.
#include <stdio.h>
#include <stdlib.h>
// tell gcc *this* function is called with a misaligned RSP
__attribute__((force_align_arg_pointer))
void _start(int dummy1, int dummy2, int dummy3, int dummy4, int dummy5, int dummy6, // register args
char *argv0) {
int argc = (int)(long)__builtin_return_address(0); // load (%rsp), casts to silence gcc warnings.
char **argv = &argv0;
printf("argc = %d, argv[argc-1] = %s\n", argc, argv[argc-1]);
printf("%f\n", 1.234); // segfaults if RSP is misaligned
exit(0);
//_exit(0); // without flushing stdio buffers!
}
# with a version without the FP printf
peter#volta:~/src/SO$ gcc -nostartfiles _start.c -o bare_start
peter#volta:~/src/SO$ ./bare_start
argc = 1, argv[argc-1] = ./bare_start
peter#volta:~/src/SO$ ./bare_start abc def hij
argc = 4, argv[argc-1] = hij
peter#volta:~/src/SO$ file bare_start
bare_start: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=af27c8416b31bb74628ef9eec51a8fc84e49550c, not stripped
# I could have used -fno-pie -no-pie to make a non-PIE executable
This works with or without optimization, with gcc7.3. I was worried that without optimization, the address of argv0 would be below rbp where it copies the arg, rather than its original location. But apparently it works.
gcc -nostartfiles links glibc but not the CRT start files.
gcc -nostdlib omits both libraries and CRT startup files.
Very little of this is guaranteed to work, but it does in practice work with current gcc on current x86-64 Linux, and has worked in the past for years. If it breaks, you get to keep both pieces. IDK what C features are broken by omitting the CRT startup code and just relying on the dynamic linker to run glibc init functions. Also, taking the address of an arg and accessing pointers above it is UB, so you could maybe get broken code-gen. gcc7.3 happens to do what you'd expect in this case.
Things that definitely break
atexit() cleanup, e.g. flushing stdio buffers.
static destructors for static objects in dynamically-linked libraries. (On entry to _start, RDX is a function pointer you should register with atexit for this reason. In a dynamically linked executable, the dynamic linker runs before your _start and sets RDX before jumping to your _start. Statically linked executables have RDX=0 under Linux.)
gcc -mincoming-stack-boundary=3 (i.e. 2^3 = 8 bytes) is another way to get gcc to realign the stack, because the -mpreferred-stack-boundary=4 default of 2^4 = 16 is still in place. But that makes gcc assume under-aligned RSP for all functions, not just for _start, which is why I looked in the docs and found an attribute that was intended for 32-bit when the ABI transitioned from only requiring 4-byte stack alignment to the current requirement of 16-byte alignment for ESP in 32-bit mode.
The SysV ABI requirement for 64-bit mode has always been 16-byte alignment, but gcc options let you make code that doesn't follow the ABI.
// test call to a function the compiler can't inline
// to see if gcc emits extra code to re-align the stack
// like it would if we'd used -mincoming-stack-boundary=3 to assume *all* functions
// have only 8-byte (2^3) aligned RSP on entry, with the default -mpreferred-stack-boundary=4
void foo() {
int i = 0;
atoi(NULL);
}
With -mincoming-stack-boundary=3, we get stack-realignment code there, where we don't need it. gcc's stack-realignment code is pretty clunky, so we'd like to avoid that. (Not that you'd really ever use this to compile a significant program where you care about efficiency, please only use this stupid computer trick as a learning experiment.)
But anyway, see the code on the Godbolt compiler explorer with and without -mpreferred-stack-boundary=3.

Strsep returns a 32 bit pointer on my Mac OS X x86_64 system

Edit: Found solution (see bottom)
I'm trying to build a software package (owfs) on a Mac OS X 10.8 x86_64 system. This software package is mostly developed on/for linux, but I've gotten an old version to compile and function correctly, and there are plenty of posts around the internet of people running various versions of it, so I'm hopeful that it's possible. I've been able to get the compilation process to complete successfully, but the program segfaults when I try to run it. I investigated in GDB, and here is what I found:
The offending piece of code:
void FS_dir_entry_aliased(void (*dirfunc) (void *, const struct parsedname *), void *v, const struct parsedname *pn)
{
if ( ( pn->state & ePS_unaliased ) == 0 ) {
// Want alias substituted
struct parsedname s_pn_copy ;
struct parsedname * pn_copy = & s_pn_copy ;
ASCII path[PATH_MAX+3] ;
ASCII * path_pointer = path ; // current location in original path
// Shallow copy
memcpy( pn_copy, pn, sizeof(struct parsedname) ) ;
pn_copy->path[0] = '\0' ;
// path copy to use for separation
strcpy( path, pn->path ) ;
// copy segments of path (delimitted by "/") to copy
while( path_pointer != NULL ) {
ASCII * path_segment = NULL;
path_segment = strsep( &path_pointer, "/" ) ;
BYTE sn[SERIAL_NUMBER_SIZE] ;
if ( PATH_MAX < strlen(pn_copy->path) + strlen(path_segment) ) {
The call to strlen(path_segment) in the if block fails because path_segment is an out-of-bounds address.
Taking things step by step with GDB, here's what I've found. Going into the call that initializes path_segment with strsep, gdb gives me:
path_pointer = (ASCII *) 0x7fff5fbf99a5 "/uncached/bus.0/interface"
After I execute the next line, path_pointer has advanced, as I would expect:
path_pointer = (ASCII *) 0x7fff5fbf99a6 "uncached/bus.0/interface"
but gdb reports:
path_segment = (ASCII *) 0x5fbf99a5 <Address 0x5fbf99a5 out of bounds>
This is the correct "start" of the address, but it's a 32-bit pointer address. Since my system is a 64-bit system, when strlen gets called on this, I get a EXC_BAD_ACCESS.
I understand there's a lot going on here, and I haven't provided enough info if the problem is something buried deeply in the package build process. It seems to me though that this might have a simple solution, since the function is basically working correctly, but just returning the wrong size pointer. I have very little experience with the intricacies of 64 bit vs. 32 bit systems though, so I was wondering if anyone can spot something obvious, or provide instructions for next steps debugging this issue. Interestingly, when I run the strsep commands by hand in gdb (p (ASCII *) strsep (&path_pointer, "/")), they appear to work fine, giving me 64 bit pointers that match what I expect them to.
Finally, in case it's useful, here's what I believe are the assembly lines for the strsep call:
0x0000000100036eee <FS_dir_entry_aliased+318>: callq 0x1000a56a0 <dyld_stub_strsep>
0x0000000100036ef3 <FS_dir_entry_aliased+323>: mov %eax,%ecx
0x0000000100036ef5 <FS_dir_entry_aliased+325>: movslq %ecx,%rcx
0x0000000100036ef8 <FS_dir_entry_aliased+328>: mov %rcx,-0x1d10(%rbp)
and the disassemble for the strsep that's at that callq address:
0x00007fff915c0fc7 <strsep+0>: push %rbp
0x00007fff915c0fc8 <strsep+1>: mov %rsp,%rbp
0x00007fff915c0fcb <strsep+4>: mov (%rdi),%r8
0x00007fff915c0fce <strsep+7>: xor %eax,%eax
0x00007fff915c0fd0 <strsep+9>: test %r8,%r8
0x00007fff915c0fd3 <strsep+12>: je 0x7fff915c1007 <strsep+64>
0x00007fff915c0fd5 <strsep+14>: mov %r8,%r10
0x00007fff915c0fd8 <strsep+17>: jmp 0x7fff915c0fe4 <strsep+29>
0x00007fff915c0fda <strsep+19>: inc %rdx
0x00007fff915c0fdd <strsep+22>: test %al,%al
0x00007fff915c0fdf <strsep+24>: jne 0x7fff915c0fee <strsep+39>
0x00007fff915c0fe1 <strsep+26>: mov %r9,%r10
0x00007fff915c0fe4 <strsep+29>: mov (%r10),%cl
0x00007fff915c0fe7 <strsep+32>: lea 0x1(%r10),%r9
0x00007fff915c0feb <strsep+36>: mov %rsi,%rdx
0x00007fff915c0fee <strsep+39>: mov (%rdx),%al
0x00007fff915c0ff0 <strsep+41>: cmp %cl,%al
0x00007fff915c0ff2 <strsep+43>: jne 0x7fff915c0fda <strsep+19>
0x00007fff915c0ff4 <strsep+45>: xor %edx,%edx
0x00007fff915c0ff6 <strsep+47>: test %cl,%cl
0x00007fff915c0ff8 <strsep+49>: je 0x7fff915c1001 <strsep+58>
0x00007fff915c0ffa <strsep+51>: movb $0x0,(%r10)
0x00007fff915c0ffe <strsep+55>: mov %r9,%rdx
0x00007fff915c1001 <strsep+58>: mov %rdx,(%rdi)
0x00007fff915c1004 <strsep+61>: mov %r8,%rax
0x00007fff915c1007 <strsep+64>: pop %rbp
0x00007fff915c1008 <strsep+65>: retq
Edit:
string.h is definitely being included, there's a system-wide include file that includes it, but just to make sure I tried adding it this specific file. -D_BSD_SOURCE=1 is one of the compiler options added automatically, so that is happening. There is also a compiler option -D_ISOC99_SOURCE=1 by default. I tried adding -std=gnu99 (my compiler, llvm-gcc 4.2, doesn't like -std=gnu90), but it did not fix it. I'm getting the following compiler errors:
Here are the compiler warnings that I get for this file:
ow_alias.c: In function 'ReadAliasFile':
ow_alias.c:40: warning: implicit declaration of function 'getline'
ow_alias.c:48: warning: implicit declaration of function 'strsep'
ow_alias.c:48: warning: assignment makes pointer from integer without a cast
ow_alias.c:64: warning: assignment makes pointer from integer without a cast
ow_alias.c: In function 'FS_dir_entry_aliased':
ow_alias.c:177: warning: initialization makes pointer from integer without a cast
Second Edit:
After doing a little more digging (thanks for the gcc -E | grep strsep tip!), I found the problem was with a compiler flag -D_POSIX_C_SOURCE=200112L. This compiler flag was blocking the definition of strsep, making the compiler make an implicit declaration. I removed this from the configure script and everything seems to be working fine. Thanks for the help!

You need to make sure that string.h is included.
If that's already being done, you'll need to also define a feature test macro to enable the declaration of strsep(). Adding -D_GNU_SOURCE or -D_BSD_SOURCE should do the trick. However, you might want to look into how you have things configured right now - _BSD_SOURCE is usually enabled by default. Perhaps you're telling gcc to compile with -std=c90 or -std=c99, which will turn off many feature test macros. Use -std=gnu90 or -std=gnu99 instead.
Are you seeing warnings such as "initialization makes pointer from integer without a cast" or "implicit declaration of function"?

How to get c code to execute hex machine code?

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?

Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.

Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).

You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.

This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.

You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx

Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

gcc, strict-aliasing, and horror stories [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
In gcc-strict-aliasing-and-casting-through-a-union I asked whether anyone had encountered problems with union punning through pointers. So far, the answer seems to be No.
This question is broader: Do you have any horror stories about gcc and strict-aliasing?
Background: Quoting from AndreyT's answer in c99-strict-aliasing-rules-in-c-gcc:
"Strict aliasing rules are rooted in parts of the standard that were present in C and C++ since the beginning of [standardized] times. The clause that prohibits accessing object of one type through a lvalue of another type is present in C89/90 (6.3) as well as in C++98 (3.10/15). ... It is just that not all compilers wanted (or dared) to enforce it or rely on it."
Well, gcc is now daring to do so, with its -fstrict-aliasing switch. And this has caused some problems. See, for example, the excellent article http://davmac.wordpress.com/2009/10/ about a Mysql bug, and the equally excellent discussion in http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html.
Some other less-relevant links:
performance-impact-of-fno-strict-aliasing
strict-aliasing
when-is-char-safe-for-strict-pointer-aliasing
how-to-detect-strict-aliasing-at-compile-time
So to repeat, do you have a horror story of your own? Problems not indicated by -Wstrict-aliasing would, of course, be preferred. And other C compilers are also welcome.
Added June 2nd: The first link in Michael Burr's answer, which does indeed qualify as a horror story, is perhaps a bit dated (from 2003). I did a quick test, but the problem has apparently gone away.
Source:
#include <string.h>
struct iw_event { /* dummy! */
int len;
};
char *iwe_stream_add_event(
char *stream, /* Stream of events */
char *ends, /* End of stream */
struct iw_event *iwe, /* Payload */
int event_len) /* Real size of payload */
{
/* Check if it's possible */
if ((stream + event_len) < ends) {
iwe->len = event_len;
memcpy(stream, (char *) iwe, event_len);
stream += event_len;
}
return stream;
}
The specific complaint is:
Some users have complained that when the [above] code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which means a bogus len is mem-copied into the stream).
Compiled code, using gcc 4.3.4 on CYGWIN wih -O3 (please correct me if I am wrong--my assembler is a bit rusty!):
_iwe_stream_add_event:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $20, %esp
movl 8(%ebp), %eax # stream --> %eax
movl 20(%ebp), %edx # event_len --> %edx
leal (%eax,%edx), %ebx # sum --> %ebx
cmpl 12(%ebp), %ebx # compare sum with ends
jae L2
movl 16(%ebp), %ecx # iwe --> %ecx
movl %edx, (%ecx) # event_len --> iwe->len (!!)
movl %edx, 8(%esp) # event_len --> stack
movl %ecx, 4(%esp) # iwe --> stack
movl %eax, (%esp) # stream --> stack
call _memcpy
movl %ebx, %eax # sum --> retval
L2:
addl $20, %esp
popl %ebx
leave
ret
And for the second link in Michael's answer,
*(unsigned short *)&a = 4;
gcc will usually (always?) give a warning. But I believe a valid solution to this (for gcc) is to use:
#define CAST(type, x) (((union {typeof(x) src; type dst;}*)&(x))->dst)
// ...
CAST(unsigned short, a) = 4;
I've asked SO whether this is OK in gcc-strict-aliasing-and-casting-through-a-union, but so far nobody disagrees.

No horror story of my own, but here are some quotes from Linus Torvalds (sorry if these are already in one of the linked references in the question):
http://lkml.org/lkml/2003/2/26/158:
Date Wed, 26 Feb 2003 09:22:15 -0800
Subject Re: Invalid compilation without -fno-strict-aliasing
From Jean Tourrilhes <>
On Wed, Feb 26, 2003 at 04:38:10PM +0100, Horst von Brand wrote:
Jean Tourrilhes <> said:
It looks like a compiler bug to me...
Some users have complained that when the following code is
compiled without the -fno-strict-aliasing, the order of the write and
memcpy is inverted (which mean a bogus len is mem-copied into the
stream).
Code (from linux/include/net/iw_handler.h) :
static inline char *
iwe_stream_add_event(char * stream, /* Stream of events */
char * ends, /* End of stream */
struct iw_event *iwe, /* Payload */
int event_len) /* Real size of payload */
{
/* Check if it's possible */
if((stream + event_len) < ends) {
iwe->len = event_len;
memcpy(stream, (char *) iwe, event_len);
stream += event_len;
}
return stream;
}
IMHO, the compiler should have enough context to know that the
reordering is dangerous. Any suggestion to make this simple code more
bullet proof is welcomed.
The compiler is free to assume char *stream and struct iw_event *iwe point
to separate areas of memory, due to strict aliasing.
Which is true and which is not the problem I'm complaining about.
(Note with hindsight: this code is fine, but Linux's implementation of memcpy was a macro that cast to long * to copy in larger chunks. With a correctly-defined memcpy, gcc -fstrict-aliasing isn't allowed to break this code. But it means you need inline asm to define a kernel memcpy if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)
And Linus Torvald's comment on the above:
Jean Tourrilhes wrote:
>
It looks like a compiler bug to me...
Why do you think the kernel uses "-fno-strict-aliasing"?
The gcc people are more interested in trying to find out what can be
allowed by the c99 specs than about making things actually work. The
aliasing code in particular is not even worth enabling, it's just not
possible to sanely tell gcc when some things can alias.
Some users have complained that when the following code is
compiled without the -fno-strict-aliasing, the order of the write and
memcpy is inverted (which mean a bogus len is mem-copied into the
stream).
The "problem" is that we inline the memcpy(), at which point gcc won't
care about the fact that it can alias, so they'll just re-order
everything and claim it's out own fault. Even though there is no sane
way for us to even tell gcc about it.
I tried to get a sane way a few years ago, and the gcc developers really
didn't care about the real world in this area. I'd be surprised if that
had changed, judging by the replies I have already seen.
I'm not going to bother to fight it.
Linus
http://www.mail-archive.com/linux-btrfs#vger.kernel.org/msg01647.html:
Type-based aliasing is stupid. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a "by-the-letter-of-the-law" thing that makes no sense.
...
I know for a fact that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that
unsigned long a;
a = 5;
*(unsigned short *)&a = 4;
could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say "nyaah, nyaah, the standards people said we can do this", with absolutely no introspection to ask whether it made any SENSE.

SWIG generates code that depends on strict aliasing being off, which can cause all sorts of problems.
SWIGEXPORT jlong JNICALL Java_com_mylibJNI_make_1mystruct_1_1SWIG_12(
JNIEnv *jenv, jclass jcls, jint jarg1, jint jarg2) {
jlong jresult = 0 ;
int arg1 ;
int arg2 ;
my_struct_t *result = 0 ;
(void)jenv;
(void)jcls;
arg1 = (int)jarg1;
arg2 = (int)jarg2;
result = (my_struct_t *)make_my_struct(arg1,arg2);
*(my_struct_t **)&jresult = result; /* <<<< horror*/
return jresult;
}

gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:
#include <stdio.h>
static void copy(int n, int a[][n], int b[][n]) {
int i, j;
for (i = 0; i < 2; i++) // 'n' not used in this example
for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
b[i][j] = a[i][j];
}
int main(int argc, char *argv[]) {
int a[2][2] = {{1, 2},{3, 4}};
int b[2][2];
copy(2, a, b);
printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
return 0;
}
With gcc 4.1.2 on CentOS, I get:
$ gcc -O1 test.c && a.out
1 2 3 4
$ gcc -O2 test.c && a.out
10235717 -1075970308 -1075970456 11452404 (random)
I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:
Use __attribute__((noinline)) for copy().
Use the gcc switch -fno-strict-aliasing.
Change the third parameter of copy() from b[][n] to b[][2].
Don't use -O2 or -O3.
Further notes:
This is an answer, after a year and a day, to my own question (and I'm a bit surprised there are only two other answers).
I lost several hours with this on my actual code, a Kalman filter. Seemingly small changes would have drastic effects, perhaps because of changing gcc's automatic inlining (this is a guess; I'm still uncertain). But it probably doesn't qualify as a horror story.
Yes, I know you wouldn't write copy() like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)
No gcc warning switches, include -Wstrict-aliasing=, did anything here.
1-D variable-length arrays seem to be OK.
Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.
I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.
And I have a slightly simpler example of a similar bug, with only one matrix. The code:
static void zero(int n, int a[][n]) {
int i, j;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[i][j] = 0;
}
int main(void) {
int a[2][2] = {{1, 2},{3, 4}};
zero(2, a);
printf("%d\n", a[1][1]);
return 0;
}
produces the results:
gcc -O1 test.c && a.out
0
gcc -O1 -fstrict-aliasing test.c && a.out
4
It seems it is the combination -fstrict-aliasing with -finline which causes the bug.

here is mine:
http://forum.openscad.org/CGAL-3-6-1-causing-errors-but-CGAL-3-6-0-OK-tt2050.html
it caused certain shapes in a CAD program to be drawn incorrectly. thank goodness for the project's leaders work on creating a regression test suite.
the bug only manifested itself on certain platforms, with older versions of GCC and older versions of certain libraries. and then only with -O2 turned on. -fno-strict-aliasing solved it.

The Common Initial Sequence rule of C used to be interpreted as making it
possible to write a function which could work on the leading portion of a
wide variety of structure types, provided they start with elements of matching
types. Under C99, the rule was changed so that it only applied if the structure
types involved were members of the same union whose complete declaration was visible at the point of use.
The authors of gcc insist that the language in question is only applicable if
the accesses are performed through the union type, notwithstanding the facts
that:
There would be no reason to specify that the complete declaration must be visible if accesses had to be performed through the union type.
Although the CIS rule was described in terms of unions, its primary
usefulness lay in what it implied about the way in which structs were
laid out and accessed. If S1 and S2 were structures that shared a CIS,
there would be no way that a function that accepted a pointer to an S1
and an S2 from an outside source could comply with C89's CIS rules
without allowing the same behavior to be useful with pointers to
structures that weren't actually inside a union object; specifying CIS
support for structures would thus have been redundant given that it was
already specified for unions.

The following code returns 10, under gcc 4.4.4. Is anything wrong with the union method or gcc 4.4.4?
int main()
{
int v = 10;
union vv {
int v;
short q;
} *s = (union vv *)&v;
s->v = 1;
return v;
}