GCC doesn't use built in atomics with -fopenmp [duplicate] - c

I have stumbled upon the following problem. The below code snippet does not link on Mac OS X with any Xcode I tried (4.4, 4.5)
#include <stdlib.h>
#include <string.h>
#include <emmintrin.h>
int main(int argc, char *argv[])
{
char *temp;
#pragma omp parallel
{
__m128d v_a, v_ar;
memcpy(temp, argv[0], 10);
v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}
}
The code is just provided as an example and would segfault when you run it. The point is that it does not compile. The compilation is done using the following line
/Applications/Xcode.app/Contents/Developer/usr/bin/gcc test.c -arch x86_64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.7.sdk -mmacosx-version-min=10.7 -fopenmp
Undefined symbols for architecture x86_64:
"___builtin_ia32_shufpd", referenced from:
_main.omp_fn.0 in ccJM7RAw.o
"___builtin_object_size", referenced from:
_main.omp_fn.0 in ccJM7RAw.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
The code compiles just fine when not using the -fopenmp flag to gcc. Now, I googled around and found a solution for the first problem connected with memcpy, which is adding -fno-builtin, or -D_FORTIFY_SOURCE=0 to gcc arguments list. I did not manage to solve the second problem (sse intrinsic).
Can anyone help me to solve this? The questions:
most importantly: how to get rid of the "___builtin_ia32_shufpd" error?
what exactly is the reason for the memcpy problem, and what does the -D_FORTIFY_SOURCE=0 flag eventually do?

This is a bug in the way Apple's LLVM-backed GCC (llvm-gcc) transforms OpenMP regions and handles calls to the built-ins inside them. The problem can be diagnosed by examining the intermediate tree dumps (obtainable by passing -fdump-tree-all argument to gcc). Without OpenMP enabled the following final code representation is generated (from the test.c.016t.fap):
main (argc, argv)
{
D.6544 = __builtin_object_size (temp, 0);
D.6545 = __builtin_object_size (temp, 0);
D.6547 = __builtin___memcpy_chk (temp, D.6546, 10, D.6545);
D.6550 = __builtin_ia32_shufpd (v_a, v_a, 1);
}
This is a C-like representation of how the compiler sees the code internally after all transformations. This is what is then gets turned into assembly instructions. (only those lines that refer to the built-ins are shown here)
With OpenMP enabled the parallel region is extracted into own function, main.omp_fn.0:
main.omp_fn.0 (.omp_data_i)
{
void * (*<T4f6>) (void *, const <unnamed type> *, long unsigned int, long unsigned int) __builtin___memcpy_chk.21;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.20;
vector double (*<T6b5>) (vector double, vector double, int) __builtin_ia32_shufpd.23;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.19;
__builtin_object_size.19 = __builtin_object_size;
D.6587 = __builtin_object_size.19 (D.6603, 0);
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
D.6593 = __builtin_ia32_shufpd.23 (v_a, v_a, 1);
__builtin_object_size.20 = __builtin_object_size;
D.6588 = __builtin_object_size.20 (D.6605, 0);
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
D.6590 = __builtin___memcpy_chk.21 (D.6609, D.6589, 10, D.6588);
}
Again I have only left the code that refers to the builtins. What is apparent (but the reason for that is not immediately apparent to me) is that the OpenMP code trasnformer really insists on calling all the built-ins through function pointers. These pointer asignments:
__builtin_object_size.19 = __builtin_object_size;
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
__builtin_object_size.20 = __builtin_object_size;
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
generate external references to symbols which are not really symbols but rather names that get special treatment by the compiler. The linker then tries to resolve them but is unable to find any of the __builtin_* names in any of the object files that the code is linked against. This is also observable in the assembly code that one can obtain by passing -S to gcc:
LBB2_1:
movapd -48(%rbp), %xmm0
movl $1, %eax
movaps %xmm0, -80(%rbp)
movaps -80(%rbp), %xmm1
movl %eax, %edi
callq ___builtin_ia32_shufpd
movapd %xmm0, -32(%rbp)
This basically is a function call that takes 3 arguments: one integer in %eax and two XMM arguments in %xmm0 and %xmm1, with the result being returned in %xmm0 (as per the SysV AMD64 ABI function calling convention). In contrast, the code generated without -fopenmp is an instruction-level expansion of the intrinsic as it is supposed to happen:
LBB1_3:
movapd -64(%rbp), %xmm0
shufpd $1, %xmm0, %xmm0
movapd %xmm0, -80(%rbp)
What happens when you pass -D_FORTIFY_SOURCE=0 is that memcpy is not replaced by the "fortified" checking version and a regular call to memcpy is used instead. This eliminates the references to object_size and __memcpy_chk but cannot remove the call to the ia32_shufpd built-in.
This is obviously a compiler bug. If you really really really must use Apple's GCC to compile the code, then an interim solution would be to move the offending code to an external function as the bug apparently only affects code that gets extracted from parallel regions:
void func(char *temp, char *argv0)
{
__m128d v_a, v_ar;
memcpy(temp, argv0, 10);
v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}
int main(int argc, char *argv[])
{
char *temp;
#pragma omp parallel
{
func(temp, argv[0]);
}
}
The overhead of one additional function call is neglegible compared to the overhead of entering and exiting the parallel region. You can use OpenMP pragmas inside func - they will work because of the dynamic scoping of the parallel region.
May be Apple would provide a fixed compiler in the future, may they won't, given their commitment to replacing GCC with Clang.

Related

How Get arguments value using inline assembly in C without Glibc?

How Get arguments value using inline assembly in C without Glibc?
i require this code for Linux archecture x86_64 and i386.
if you know about MAC OS X or Windows , also submit and please guide.
void exit(int code)
{
//This function not important!
//...
}
void _start()
{
//How Get arguments value using inline assembly
//in C without Glibc?
//argc
//argv
exit(0);
}
New Update
https://gist.github.com/apsun/deccca33244471c1849d29cc6bb5c78e
and
#define ReadRdi(To) asm("movq %%rdi,%0" : "=r"(To));
#define ReadRsi(To) asm("movq %%rsi,%0" : "=r"(To));
long argcL;
long argvL;
ReadRdi(argcL);
ReadRsi(argvL);
int argc = (int) argcL;
//char **argv = (char **) argvL;
exit(argc);
But it still returns 0.
So this code is wrong!
please help.
As specified in the comment, argc and argv are provided on the stack, so you cannot use a regular C function to get them, even with inline assembly, as the compiler will touch the stack pointer to allocate the local variables, setup the stack frame & co.; hence, _start must be written in assembly, as it's done in glibc (x86; x86_64). A small stub can be written to just grab the stuff and forward it to your "real" C entrypoint according to the regular calling convention.
Here a minimal example of a program (both for x86 and x86_64) that reads argc and argv, prints all the values in argv on stdout (separated by newline) and exits using argc as status code; it can be compiled with the usual gcc -nostdlib (and -static to make sure ld.so isn't involved; not that it does any harm here).
#ifdef __x86_64__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movq 0(%rsp),%rdi\n" // get argc
" lea 8(%rsp),%rsi\n" // the arguments are pushed just below, so argv = %rbp + 8
" call bare_main\n" // call our bare_main
" movq %rax,%rdi\n" // take the main return code and use it as first argument for...
" movl $60,%eax\n" // ... the exit syscall
" syscall\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; the calling convention is pretty much ok as is
" movq $1,%rax\n" // 1 = write syscall on x86_64
" syscall\n"
" ret\n");
#endif
#ifdef __i386__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movl 0(%esp),%edi\n" // argc is on the top of the stack
" lea 4(%esp),%esi\n" // as above, but with 4-byte pointers
" sub $8,%esp\n" // the start starts 16-byte aligned, we have to push 2*4 bytes; "waste" 8 bytes
" pushl %esi\n" // to keep it aligned after pushing our arguments
" pushl %edi\n"
" call bare_main\n" // call our bare_main
" add $8,%esp\n" // fix the stack after call (actually useless here)
" movl %eax,%ebx\n" // take the main return code and use it as first argument for...
" movl $1,%eax\n" // ... the exit syscall
" int $0x80\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; convert the user-mode calling convention to the syscall convention
" pushl %ebx\n" // ebx is callee-preserved
" movl 8(%esp),%ebx\n" // just move stuff from the stack to the correct registers
" movl 12(%esp),%ecx\n"
" movl 16(%esp),%edx\n"
" mov $4,%eax\n" // 4 = write syscall on i386
" int $0x80\n"
" popl %ebx\n" // restore ebx
" ret\n"); // notice: the return value is already ok in %eax
#endif
int bare_write(int fd, const void *buf, unsigned count);
unsigned my_strlen(const char *ch) {
const char *ptr;
for(ptr = ch; *ptr; ++ptr);
return ptr-ch;
}
int bare_main(int argc, char *argv[]) {
for(int i = 0; i < argc; ++i) {
int len = my_strlen(argv[i]);
bare_write(1, argv[i], len);
bare_write(1, "\n", 1);
}
return argc;
}
Notice that here several subtleties are ignored - in particular, the atexit bit. All the documentation about the machine-specific startup state has been extracted from the comments in the two glibc files linked above.
This answer is for x86-64, 64-bit Linux ABI, only. All the other OSes and ABIs mentioned will be broadly similar, but different enough in the fine details that you will need to write your custom _start once for each.
You are looking for the specification of the initial process state in the "x86-64 psABI", or, to give it its full title, "System V Application Binary Interface, AMD64 Architecture Processor Supplement (With LP64 and ILP32 Programming Models)". I will reproduce figure 3.9, "Initial Process Stack", here:
Purpose Start Address Length
------------------------------------------------------------------------
Information block, including varies
argument strings, environment
strings, auxiliary information
...
------------------------------------------------------------------------
Null auxiliary vector entry 1 eightbyte
Auxiliary vector entries... 2 eightbytes each
0 eightbyte
Environment pointers... 1 eightbyte each
0 8+8*argc+%rsp eightbyte
Argument pointers... 8+%rsp argc eightbytes
Argument count %rsp eightbyte
It goes on to say that the initial registers are unspecified except
for %rsp, which is of course the stack pointer, and %rdx, which may contain "a function pointer to register with atexit".
So all the information you are looking for is already present in memory, but it hasn't been laid out according to the normal calling convention, which means you must write _start in assembly language. It is _start's responsibility to set everything up to call main with, based on the above. A minimal _start would look something like this:
_start:
xorl %ebp, %ebp # mark the deepest stack frame
# Current Linux doesn't pass an atexit function,
# so you could leave out this part of what the ABI doc says you should do
# You can't just keep the function pointer in a call-preserved register
# and call it manually, even if you know the program won't call exit
# directly, because atexit functions must be called in reverse order
# of registration; this one, if it exists, is meant to be called last.
testq %rdx, %rdx # is there "a function pointer to
je skip_atexit # register with atexit"?
movq %rdx, %rdi # if so, do it
call atexit
skip_atexit:
movq (%rsp), %rdi # load argc
leaq 8(%rsp), %rsi # calc argv (pointer to the array on the stack)
leaq 8(%rsp,%rdi,8), %rdx # calc envp (starts after the NULL terminator for argv[])
call main
movl %eax, %edi # pass return value of main to exit
call exit
hlt # should never get here
(Completely untested.)
(In case you're wondering why there's no adjustment to maintain stack pointer alignment, this is because upon a normal procedure call, 8(%rsp) is 16-byte aligned, but when _start is called, %rsp itself is 16-byte aligned. Each call instruction displaces %rsp down by eight, producing the alignment situation expected by normal compiled functions.)
A more thorough _start would do more things, such as clearing all the other registers, arranging for greater stack pointer alignment than the default if desired, calling into the C library's own initialization functions, setting up environ, initializing the state used by thread-local storage, doing something constructive with the auxiliary vector, etc.
You should also be aware that if there is a dynamic linker (PT_INTERP section in the executable), it receives control before _start does. Glibc's ld.so cannot be used with any C library other than glibc itself; if you are writing your own C library, and you want to support dynamic linkage, you will also need to write your own ld.so. (Yes, this is unfortunate; ideally, the dynamic linker would be a separate development project and its complete interface would be specified.)
As a quick and dirty hack, you can make an executable with a compiled C function as the ELF entry point. Just make sure you use exit or _exit instead of returning.
(Link with gcc -nostartfiles to omit CRT but still link other libraries, and write a _start() in C. Beware of ABI violations like stack alignment, e.g. use -mincoming-stack-boundary=2 or an __attribte__ on _start, as in Compiling without libc)
If it's dynamically linked, you can still use glibc functions on Linux (because the dynamic linker runs glibc's init functions). Not all systems are like this, e.g. on cygwin you definitely can't call libc functions if you (or the CRT start code) hasn't called the libc init functions in the correct order. I'm not sure it's even guaranteed that this works on Linux, so don't depend on it except for experimentation on your own system.
I have used a C _start(void){ ... } + calling _exit() for making a static executable to microbenchmark some compiler-generated code with less startup overhead for perf stat ./a.out.
Glibc's _exit() works even if glibc wasn't initialized (gcc -O3 -static), or use inline asm to run xor %edi,%edi / mov $60, %eax / syscall (sys_exit(0) on Linux) so you don't have to even statically link libc. (gcc -O3 -nostdlib)
With even more dirty hacking and UB, you can access argc and argv by knowing the x86-64 System V ABI that you're compiling for (see #zwol's answer for a quote from ABI doc), and how the process startup state differers from the function calling convention:
argc is where the return address would be for a normal function (pointed to by RSP). GNU C has a builtin for accessing the return address of the current function (or for walking up the stack.)
argv[0] is where the 7th integer/pointer arg should be (the first stack arg, just above the return address). It happens to / seems to work to take its address and use that as an array!
// Works only for the x86-64 SystemV ABI; only tested on Linux.
// DO NOT USE THIS EXCEPT FOR EXPERIMENTS ON YOUR OWN COMPUTER.
#include <stdio.h>
#include <stdlib.h>
// tell gcc *this* function is called with a misaligned RSP
__attribute__((force_align_arg_pointer))
void _start(int dummy1, int dummy2, int dummy3, int dummy4, int dummy5, int dummy6, // register args
char *argv0) {
int argc = (int)(long)__builtin_return_address(0); // load (%rsp), casts to silence gcc warnings.
char **argv = &argv0;
printf("argc = %d, argv[argc-1] = %s\n", argc, argv[argc-1]);
printf("%f\n", 1.234); // segfaults if RSP is misaligned
exit(0);
//_exit(0); // without flushing stdio buffers!
}
# with a version without the FP printf
peter#volta:~/src/SO$ gcc -nostartfiles _start.c -o bare_start
peter#volta:~/src/SO$ ./bare_start
argc = 1, argv[argc-1] = ./bare_start
peter#volta:~/src/SO$ ./bare_start abc def hij
argc = 4, argv[argc-1] = hij
peter#volta:~/src/SO$ file bare_start
bare_start: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=af27c8416b31bb74628ef9eec51a8fc84e49550c, not stripped
# I could have used -fno-pie -no-pie to make a non-PIE executable
This works with or without optimization, with gcc7.3. I was worried that without optimization, the address of argv0 would be below rbp where it copies the arg, rather than its original location. But apparently it works.
gcc -nostartfiles links glibc but not the CRT start files.
gcc -nostdlib omits both libraries and CRT startup files.
Very little of this is guaranteed to work, but it does in practice work with current gcc on current x86-64 Linux, and has worked in the past for years. If it breaks, you get to keep both pieces. IDK what C features are broken by omitting the CRT startup code and just relying on the dynamic linker to run glibc init functions. Also, taking the address of an arg and accessing pointers above it is UB, so you could maybe get broken code-gen. gcc7.3 happens to do what you'd expect in this case.
Things that definitely break
atexit() cleanup, e.g. flushing stdio buffers.
static destructors for static objects in dynamically-linked libraries. (On entry to _start, RDX is a function pointer you should register with atexit for this reason. In a dynamically linked executable, the dynamic linker runs before your _start and sets RDX before jumping to your _start. Statically linked executables have RDX=0 under Linux.)
gcc -mincoming-stack-boundary=3 (i.e. 2^3 = 8 bytes) is another way to get gcc to realign the stack, because the -mpreferred-stack-boundary=4 default of 2^4 = 16 is still in place. But that makes gcc assume under-aligned RSP for all functions, not just for _start, which is why I looked in the docs and found an attribute that was intended for 32-bit when the ABI transitioned from only requiring 4-byte stack alignment to the current requirement of 16-byte alignment for ESP in 32-bit mode.
The SysV ABI requirement for 64-bit mode has always been 16-byte alignment, but gcc options let you make code that doesn't follow the ABI.
// test call to a function the compiler can't inline
// to see if gcc emits extra code to re-align the stack
// like it would if we'd used -mincoming-stack-boundary=3 to assume *all* functions
// have only 8-byte (2^3) aligned RSP on entry, with the default -mpreferred-stack-boundary=4
void foo() {
int i = 0;
atoi(NULL);
}
With -mincoming-stack-boundary=3, we get stack-realignment code there, where we don't need it. gcc's stack-realignment code is pretty clunky, so we'd like to avoid that. (Not that you'd really ever use this to compile a significant program where you care about efficiency, please only use this stupid computer trick as a learning experiment.)
But anyway, see the code on the Godbolt compiler explorer with and without -mpreferred-stack-boundary=3.

How to optimize "don't care" argument with gcc?

Sometimes a function doesn't use an argument (perhaps because another "flags" argument doesn't enable a specific feature).
However, you have to specify something, so usually you just put 0. But if you do that, and the function is external, gcc will emit code to "really make sure" that parameter gets set to 0.
Is there a way to tell gcc that a particular argument to a function doesn't matter and it can leave alone whatever value it is that happens to be in the argument register right then?
Update: Someone asked about the XY problem. The context behind this question is I want to implement a varargs function in x86_64 without using the compiler varargs support. This is simplest when the parameters are on the stack, so I declare my functions to take 5 or 6 dummy parameters first, so that the last non-vararg parameter and all of the vararg parameters end up on the stack. This works fine, except it's clearly not optimal - when looking at the assembly code it's clear that gcc is initializing all those argument registers to zero in the caller.
Please don't take below answer seriously. The question asks for a hack so there you go.
GCC will effectively treat value of uninitialized variable as "don't care" so we can try exploiting this:
int foo(int x, int y);
int bar_1(int y) {
int tmp = tmp; // Suppress uninitialized warnings
return foo(tmp, y);
}
Unfortunately my version of GCC still cowardly initializes tmp to zero but yours may be more aggressive:
bar_1:
.LFB0:
.cfi_startproc
movl %edi, %esi
xorl %edi, %edi
jmp foo
.cfi_endproc
Another option is (ab)using inline assembly to fake GCC into thinking that tmp is defined (when in fact it isn't):
int bar_2(int y) {
int tmp;
asm("" : "=r"(tmp));
return foo(tmp, y);
}
With this GCC managed to get rid of parameter initializations:
bar_2:
.LFB1:
.cfi_startproc
movl %edi, %esi
jmp foo
.cfi_endproc
Note that inline asm must be immediately before the function call, otherwise GCC will think it has to preserve output values which would harm register allocation.

Can I force gcc/clang to emit a function call, even when optimising?

I have some code that runs x64 programs under ptrace and manipulates their code. For testing this sort of thing, I have a placeholder function in one of my test programs:
uint32_t __attribute__((noinline)) func(void) {
return 0xCCCCCCCC;
}
The 0xCCCCCCCC is then replaced at runtime with some other constant.
When I compile with clang and optimisation, though, clang replaces calls to the function with the constant 0xCCCCCCCC (since the function is so simple). For example, here's part of the code generated for the call to check that the call to modified TestFunc() returns 0xCDCDCDCD, which it should after having been modified:
movl $3435973836, %edi ## imm = 0xCCCCCCCC
movl $3435973836, %eax ## imm = 0xCCCCCCCC
leaq 16843009(%rax), %rdx ## [16843009 = 0x01010101]
As you might imagine, this test then fails, because these two mismatching constant values won't match!
Can I force clang (and, hopefully in the same way, gcc...) to generate the function call?
(I don't want to disable optimisations globally - I already have a build that does that (and with that build, all tests pass). I'd just like to do a spot fix, if that's possible.)
Use attribute optnone or optimize to set the optimization level for func to -O0.
Only optnone is defined on my machine. So the code for me is the following, but yours might use the header __attribute__ ((optimize("0"))).
#include <stdlib.h>
uint32_t __attribute__ ((optnone)) func(void) {
return 0xCCCCCCCC;
}
int main()
{
return func();
}
See In clang, how do you use per-function optimization attributes?.

incompatible return type from struct function - C

When I attempt to run this code as it is, I receive the compiler message "error: incompatible types in return". I marked the location of the error in my code. If I take the line out, then the compiler is happy.
The problem is I want to return a value representing invalid input to the function (which in this case is calling f2(2).) I only want a struct returned with data if the function is called without using 2 as a parameter.
I feel the only two ways to go is to either:
make the function return a struct pointer instead of a dead-on struct but then my caller function will look funny as I have to change y.b to y->b and the operation may be slower due to the extra step of fetching data in memory.
Allocate extra memory, zero-byte fill it, and set the return value to the struct in that location in memory. (example: return x[nnn]; instead of return x[0];). This approach will use more memory and some processing to zero-byte fill it.
Ultimately, I'm looking for a solution that will be fastest and cleanest (in terms of code) in the long run. If I have to be stuck with using -> to address members of elements then I guess that's the way to go.
Does anyone have a solution that uses the least cpu power?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct{
long a;
char b;
}ab;
static char dat[500];
ab f2(int set){
ab* x=(ab*)dat;
if (set==2){return NULL;}
if (set==1){
x->a=1;
x->b='1';
x++;
x->a=2;
x->b='2';
x=(ab*)dat;
}
return x[0];
}
int main(){
ab y;
y=f2(1);
printf("%c",y.b);
y.b='D';
y=f2(0);
printf("%c",y.b);
return 0;
}
If you care about speed, it is implementation specific.
Notice that the Linux x86-64 ABI defines that a struct of two (exactly) scalar members (that is, integers, doubles, or pointers, -which all fit in a single machine register- but not struct etc... which are aggregate data) is returned thru two registers (without going thru the stack), and that is quite fast.
BTW
if (set==2){return NULL;} //wrong
is obviously wrong. You could code:
if (set==2) return (aa){0,0};
Also,
ab* x=(ab*)dat; // suspicious
looks suspicious to me (since you return x[0]; later). You are not guaranteed that dat is suitably aligned (e.g. to 8 or 16 bytes), and on some platforms (notably x86-64) if dat is misaligned you are at least losing performance (actually, it is undefined behavior).
BTW, I would suggest to always return with instructions like return (aa){l,c}; (where l is an expression convertible to long and c is an expression convertible to char); this is probably the easiest to read, and will be optimized to load the two return registers.
Of course if you care about performance, for benchmarking purposes, you should enable optimizations (and warnings), e.g. compile with gcc -Wall -Wextra -O2 -march=native if using GCC; on my system (Linux/x86-64 with GCC 5.2) the small function
ab give_both(long x, char y)
{ return (ab){x,y}; }
is compiled (with gcc -O2 -march=native -fverbose-asm -S) into:
.globl give_both
.type give_both, #function
give_both:
.LFB0:
.file 1 "ab.c"
.loc 1 7 0
.cfi_startproc
.LVL0:
.loc 1 7 0
xorl %edx, %edx # D.2139
movq %rdi, %rax # x, x
movb %sil, %dl # y, D.2139
ret
.cfi_endproc
you see that all the code is using registers, and no memory is used at all..
I would use the return value as an error code, and the caller passes in a pointer to his struct such as:
int f2(int set, ab *outAb); // returns -1 if invalid input and 0 otherwise

gcc, strict-aliasing, and horror stories [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
In gcc-strict-aliasing-and-casting-through-a-union I asked whether anyone had encountered problems with union punning through pointers. So far, the answer seems to be No.
This question is broader: Do you have any horror stories about gcc and strict-aliasing?
Background: Quoting from AndreyT's answer in c99-strict-aliasing-rules-in-c-gcc:
"Strict aliasing rules are rooted in parts of the standard that were present in C and C++ since the beginning of [standardized] times. The clause that prohibits accessing object of one type through a lvalue of another type is present in C89/90 (6.3) as well as in C++98 (3.10/15). ... It is just that not all compilers wanted (or dared) to enforce it or rely on it."
Well, gcc is now daring to do so, with its -fstrict-aliasing switch. And this has caused some problems. See, for example, the excellent article http://davmac.wordpress.com/2009/10/ about a Mysql bug, and the equally excellent discussion in http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html.
Some other less-relevant links:
performance-impact-of-fno-strict-aliasing
strict-aliasing
when-is-char-safe-for-strict-pointer-aliasing
how-to-detect-strict-aliasing-at-compile-time
So to repeat, do you have a horror story of your own? Problems not indicated by -Wstrict-aliasing would, of course, be preferred. And other C compilers are also welcome.
Added June 2nd: The first link in Michael Burr's answer, which does indeed qualify as a horror story, is perhaps a bit dated (from 2003). I did a quick test, but the problem has apparently gone away.
Source:
#include <string.h>
struct iw_event { /* dummy! */
int len;
};
char *iwe_stream_add_event(
char *stream, /* Stream of events */
char *ends, /* End of stream */
struct iw_event *iwe, /* Payload */
int event_len) /* Real size of payload */
{
/* Check if it's possible */
if ((stream + event_len) < ends) {
iwe->len = event_len;
memcpy(stream, (char *) iwe, event_len);
stream += event_len;
}
return stream;
}
The specific complaint is:
Some users have complained that when the [above] code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which means a bogus len is mem-copied into the stream).
Compiled code, using gcc 4.3.4 on CYGWIN wih -O3 (please correct me if I am wrong--my assembler is a bit rusty!):
_iwe_stream_add_event:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $20, %esp
movl 8(%ebp), %eax # stream --> %eax
movl 20(%ebp), %edx # event_len --> %edx
leal (%eax,%edx), %ebx # sum --> %ebx
cmpl 12(%ebp), %ebx # compare sum with ends
jae L2
movl 16(%ebp), %ecx # iwe --> %ecx
movl %edx, (%ecx) # event_len --> iwe->len (!!)
movl %edx, 8(%esp) # event_len --> stack
movl %ecx, 4(%esp) # iwe --> stack
movl %eax, (%esp) # stream --> stack
call _memcpy
movl %ebx, %eax # sum --> retval
L2:
addl $20, %esp
popl %ebx
leave
ret
And for the second link in Michael's answer,
*(unsigned short *)&a = 4;
gcc will usually (always?) give a warning. But I believe a valid solution to this (for gcc) is to use:
#define CAST(type, x) (((union {typeof(x) src; type dst;}*)&(x))->dst)
// ...
CAST(unsigned short, a) = 4;
I've asked SO whether this is OK in gcc-strict-aliasing-and-casting-through-a-union, but so far nobody disagrees.
No horror story of my own, but here are some quotes from Linus Torvalds (sorry if these are already in one of the linked references in the question):
http://lkml.org/lkml/2003/2/26/158:
Date Wed, 26 Feb 2003 09:22:15 -0800
Subject Re: Invalid compilation without -fno-strict-aliasing
From Jean Tourrilhes <>
On Wed, Feb 26, 2003 at 04:38:10PM +0100, Horst von Brand wrote:
Jean Tourrilhes <> said:
It looks like a compiler bug to me...
Some users have complained that when the following code is
compiled without the -fno-strict-aliasing, the order of the write and
memcpy is inverted (which mean a bogus len is mem-copied into the
stream).
Code (from linux/include/net/iw_handler.h) :
static inline char *
iwe_stream_add_event(char * stream, /* Stream of events */
char * ends, /* End of stream */
struct iw_event *iwe, /* Payload */
int event_len) /* Real size of payload */
{
/* Check if it's possible */
if((stream + event_len) < ends) {
iwe->len = event_len;
memcpy(stream, (char *) iwe, event_len);
stream += event_len;
}
return stream;
}
IMHO, the compiler should have enough context to know that the
reordering is dangerous. Any suggestion to make this simple code more
bullet proof is welcomed.
The compiler is free to assume char *stream and struct iw_event *iwe point
to separate areas of memory, due to strict aliasing.
Which is true and which is not the problem I'm complaining about.
(Note with hindsight: this code is fine, but Linux's implementation of memcpy was a macro that cast to long * to copy in larger chunks. With a correctly-defined memcpy, gcc -fstrict-aliasing isn't allowed to break this code. But it means you need inline asm to define a kernel memcpy if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)
And Linus Torvald's comment on the above:
Jean Tourrilhes wrote:
>
It looks like a compiler bug to me...
Why do you think the kernel uses "-fno-strict-aliasing"?
The gcc people are more interested in trying to find out what can be
allowed by the c99 specs than about making things actually work. The
aliasing code in particular is not even worth enabling, it's just not
possible to sanely tell gcc when some things can alias.
Some users have complained that when the following code is
compiled without the -fno-strict-aliasing, the order of the write and
memcpy is inverted (which mean a bogus len is mem-copied into the
stream).
The "problem" is that we inline the memcpy(), at which point gcc won't
care about the fact that it can alias, so they'll just re-order
everything and claim it's out own fault. Even though there is no sane
way for us to even tell gcc about it.
I tried to get a sane way a few years ago, and the gcc developers really
didn't care about the real world in this area. I'd be surprised if that
had changed, judging by the replies I have already seen.
I'm not going to bother to fight it.
Linus
http://www.mail-archive.com/linux-btrfs#vger.kernel.org/msg01647.html:
Type-based aliasing is stupid. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a "by-the-letter-of-the-law" thing that makes no sense.
...
I know for a fact that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that
unsigned long a;
a = 5;
*(unsigned short *)&a = 4;
could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say "nyaah, nyaah, the standards people said we can do this", with absolutely no introspection to ask whether it made any SENSE.
SWIG generates code that depends on strict aliasing being off, which can cause all sorts of problems.
SWIGEXPORT jlong JNICALL Java_com_mylibJNI_make_1mystruct_1_1SWIG_12(
JNIEnv *jenv, jclass jcls, jint jarg1, jint jarg2) {
jlong jresult = 0 ;
int arg1 ;
int arg2 ;
my_struct_t *result = 0 ;
(void)jenv;
(void)jcls;
arg1 = (int)jarg1;
arg2 = (int)jarg2;
result = (my_struct_t *)make_my_struct(arg1,arg2);
*(my_struct_t **)&jresult = result; /* <<<< horror*/
return jresult;
}
gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:
#include <stdio.h>
static void copy(int n, int a[][n], int b[][n]) {
int i, j;
for (i = 0; i < 2; i++) // 'n' not used in this example
for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
b[i][j] = a[i][j];
}
int main(int argc, char *argv[]) {
int a[2][2] = {{1, 2},{3, 4}};
int b[2][2];
copy(2, a, b);
printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
return 0;
}
With gcc 4.1.2 on CentOS, I get:
$ gcc -O1 test.c && a.out
1 2 3 4
$ gcc -O2 test.c && a.out
10235717 -1075970308 -1075970456 11452404 (random)
I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:
Use __attribute__((noinline)) for copy().
Use the gcc switch -fno-strict-aliasing.
Change the third parameter of copy() from b[][n] to b[][2].
Don't use -O2 or -O3.
Further notes:
This is an answer, after a year and a day, to my own question (and I'm a bit surprised there are only two other answers).
I lost several hours with this on my actual code, a Kalman filter. Seemingly small changes would have drastic effects, perhaps because of changing gcc's automatic inlining (this is a guess; I'm still uncertain). But it probably doesn't qualify as a horror story.
Yes, I know you wouldn't write copy() like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)
No gcc warning switches, include -Wstrict-aliasing=, did anything here.
1-D variable-length arrays seem to be OK.
Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.
I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.
And I have a slightly simpler example of a similar bug, with only one matrix. The code:
static void zero(int n, int a[][n]) {
int i, j;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
a[i][j] = 0;
}
int main(void) {
int a[2][2] = {{1, 2},{3, 4}};
zero(2, a);
printf("%d\n", a[1][1]);
return 0;
}
produces the results:
gcc -O1 test.c && a.out
0
gcc -O1 -fstrict-aliasing test.c && a.out
4
It seems it is the combination -fstrict-aliasing with -finline which causes the bug.
here is mine:
http://forum.openscad.org/CGAL-3-6-1-causing-errors-but-CGAL-3-6-0-OK-tt2050.html
it caused certain shapes in a CAD program to be drawn incorrectly. thank goodness for the project's leaders work on creating a regression test suite.
the bug only manifested itself on certain platforms, with older versions of GCC and older versions of certain libraries. and then only with -O2 turned on. -fno-strict-aliasing solved it.
The Common Initial Sequence rule of C used to be interpreted as making it
possible to write a function which could work on the leading portion of a
wide variety of structure types, provided they start with elements of matching
types. Under C99, the rule was changed so that it only applied if the structure
types involved were members of the same union whose complete declaration was visible at the point of use.
The authors of gcc insist that the language in question is only applicable if
the accesses are performed through the union type, notwithstanding the facts
that:
There would be no reason to specify that the complete declaration must be visible if accesses had to be performed through the union type.
Although the CIS rule was described in terms of unions, its primary
usefulness lay in what it implied about the way in which structs were
laid out and accessed. If S1 and S2 were structures that shared a CIS,
there would be no way that a function that accepted a pointer to an S1
and an S2 from an outside source could comply with C89's CIS rules
without allowing the same behavior to be useful with pointers to
structures that weren't actually inside a union object; specifying CIS
support for structures would thus have been redundant given that it was
already specified for unions.
The following code returns 10, under gcc 4.4.4. Is anything wrong with the union method or gcc 4.4.4?
int main()
{
int v = 10;
union vv {
int v;
short q;
} *s = (union vv *)&v;
s->v = 1;
return v;
}

Resources