Consider the two slightly different versions of the same code:
struct s
{
int dummy[1];
};
volatile struct s s;
int main(void)
{
s;
return 0;
}
and
struct s
{
int dummy[16];
};
volatile struct s s;
int main(void)
{
s;
return 0;
}
Here's what I'm getting with gcc 4.6.2 for them:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
movl _s, %eax
xorl %eax, %eax
leave
ret
.comm _s, 4, 2
and
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
xorl %eax, %eax
leave
ret
.comm _s, 64, 5
Please note the absence of access to s in the second case.
Is it a compiler bug or am I just dealing with the following statement of the C standard and the gcc developers simply chose such a weird implementation-definedness and are still playing by the rules?:
What constitutes an access to an object that has volatile-qualified type is implementation-defined.
What would be the reason for this difference? I'd naturally expect the whole structre being accessed (or not accessed, I'm not sure), irrespective of its size and of what's inside it.
P.S. What does your compiler (non-gcc or newer gcc) do in this case? (please answer this last question in a comment if that's the only part you're going to address, as this isn't the main question being asked, but more of a curiosity question).
There is a difference between C and C++ for this question which explains what's going on.
clang-3.4
When compiling either of these snippets as C++, the emitted assembly didn't reference s in either case. In fact a warning was issued for both:
volatile.c:8:2: warning: expression result unused; assign into a variable to force a volatile load [-Wunused-volatile-lvalue]
s;
These warnings were not issued when compiling in C99 mode. As mentioned in this blog post and this GCC wiki entry from the question comments, using s in this context causes an lvalue-to-rvalue conversion in C, but not in C++. This is confirmed by examining the Clang AST for C, as there is an ImplicitCastExpr from LvalueToRValue, which does not exist in the AST generated from C++. (The AST is not affected by the size of the struct).
A quick grep of the Clang source reveals this in the emission of aggregate expressions:
case CK_LValueToRValue:
// If we're loading from a volatile type, force the destination
// into existence.
if (E->getSubExpr()->getType().isVolatileQualified()) {
EnsureDest(E->getType());
return Visit(E->getSubExpr());
}
EnsureDest forces the emission of a stack slot, sized and typed for the expression. As the optimizers are not allowed to remove volatile accesses, they remain as a scalar load/store and a memcpy respectively in both the IR and output asm. This is the behavior I would expect, given the above.
gcc-4.8.2
Here, I observe the same behavior as in the question. However when I change the expression from s; to s.dummy;, the access does not appear in either version. I'm not familiar with the internals of gcc as I am with LLVM so I can't speculate why this would happen. But based on the above observations, I would say this is a compiler bug due to inconsistency.
Related
In this golfing answer I saw a trick where the return value is the second parameter which is not passed in.
int f(i, j)
{
j = i;
}
int main()
{
return f(3);
}
From gcc's assembly output it looks like when the code copies j = i it stores the result in eax which happens to be the return value.
f:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $3, %edi
movl $0, %eax
call f
popq %rbp
ret
So, did this happen just by being lucky? Is this documented by gcc? It only works with -O0, but it works with a bunch of values of i I tried, -m32, and a bunch of different versions of GCC.
gcc -O0 likes to evaluate expressions in the return-value register, if a register is needed at all. (GCC -O0 generally just likes to have values in the retval register, but this goes beyond picking that as the first temporary.)
I've tested a bit, and it really looks like GCC -O0 does this on purpose across multiple ISAs, sometimes even using an extra mov instruction or equivalent. IIRC I made an expression more complicated so the result of evaluation ended up in another register, but it still copied it back to the retval register.
Things like x++ that can (on x86) compile to a memory-destination inc or add won't leave the value in a register, but assignments typically will. So it's note quite like GCC is treating function bodies like GNU C statement-expressions.
This is not documented, guaranteed, or standardized by anything. It's an implementation detail, not something intended for you to take advantage of like this.
"Returning" a value this way means you're programming in "GCC -O0", not C. The wording of the code-golf rules says that programs have to work on at least one implementation. But my reading of that is that they should work for the right reasons, not because of some side-effect implementation detail. They break on clang not because clang doesn't support some language feature, just because they're not even written in C.
Breaking with optimization enabled is also not cool; some level of UB is generally acceptable in code golf, like integer wraparound or pointer-casting type punning being things that one might reasonably wish were well-defined. But this is pure abuse of an implementation detail of one compiler, not a language feature.
I argued this point in comments under the relevant answer on Codegolf.SE C golfing tips Q&A (Which incorrectly claims it works beyond GCC). That answer has 4 downvotes (and deserves more IMO), but 16 upvotes. So some members of the community disagree that this is terrible and silly.
Fun fact: in ISO C++ (but not C), having execution fall off the end of a non-void function is Undefined Behaviour, even if the caller doesn't use the result. This is true even in GNU C++; outside of -O0 GCC and clang will sometimes emit code like ud2 (illegal instruction) for a path of execution that reaches the end of a function without a return. So GCC doesn't in general define the behaviour here (which implementations are allowed to do for things that ISO C and C++ leaves undefined. e.g. gcc -fwrapv defines signed overflow as 2's complement wraparound.)
But in ISO C, it's legal to fall off the end of a non-void function: it only becomes UB if the caller uses the return value. Without -Wall GCC may not even warn. Checking return value of a function without return statement
With optimization disabled, function inlining won't happen so the UB isn't really compile-time visible. (Unless you use __attribute__((always_inline))).
Passing a 2nd arg merely gives you something to assign to. It's not important that it's a function arg. But i=i; optimizes away even with -O0 so you do need a separate variable. Also just i; optimizes away.
Fun fact: a recursive f(i){ f(i); } function body does bounce i through EAX before copying it to the first arg-passing register. So GCC just really loves EAX.
movl -4(%rbp), %eax
movl %eax, %edi
movl $0, %eax # without a full prototype, pass # of FP args in AL
call f
i++; doesn't load into EAX; it just uses a memory-destination add without loading into a register. Worth trying with gcc -O0 for ARM.
In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other):
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits
The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).
The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)
Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.
It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.
This is more or less a request for clarification on
Casting a function pointer to another type with example code
struct my_struct;
void my_callback_function(struct my_struct* arg);
void do_stuff(void (*cb)(void*));
static void my_callback_helper(void* pv)
{
my_callback_function(pv);
}
int main()
{
do_stuff(&my_callback_helper);
}
The answer says a "good" compiler should be able to optimize out
the my_callback_helper() function but I found no compiler at https://gcc.godbolt.org
that does it and the helper function gets always generated even if it's just a jump to my_callback_function() (-O3):
my_callback_helper:
jmp my_callback_function
main:
subq $8, %rsp
movl $my_callback_helper, %edi
call do_stuff
xorl %eax, %eax
addq $8, %rsp
ret
So my question is: Is there anything in the standard that prevents compilers from eliminating the helper?
There's nothing in the standard that directly prevents this optimization. But in practice, it's not always possible for compilers when they don't have a "full picture".
You have taken the address of my_callback_helper. So compiler can't easily optimize it out because it doesn't know what do_stuff does with it. In a separate module where do_stuff is defined, compiler doesn't know that it can simply use/call my_callback_function in place of its argument (my_callback_helper). In order to optimize out my_callback_helper completely, compiler has to know what do_stuff does as well. But do_stuff is an external function whose definition isn't available to compiler. So this sort of optimization may happen if you provide a definition for do_stuff and all its uses.
I've written this simple C code
int main()
{
int calc = 2+2;
return 0;
}
And I want to see how that looks in assembly, so I compiled it using gcc
$ gcc -S -o asm.s test.c
And the result was ~65 lines (Mac OS X 10.8.3) and I only found these to be related:
Where do I look for my 2+2 in this code?
Edit:
One part of the question hasn't been addressed.
If %rbp, %rsp, %eax are variables, what values do they attain in this case?
Almost all of the code you got is just useless stack manipulation. With optimization on (gcc -S -O2 test.c) you will get something like
main:
.LFB0:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE0:
Ignore every line that starts with a dot or ends with a colon: there are only two assembly instructions:
xorl %eax, %eax
ret
and they encode return 0;. (XORing a register with itself sets it to all-bits-zero. Function return values go in register %eax per the x86 ABI.) Everything to do with your int calc = 2+2; has been discarded as unused.
If you changed your code to
int main(void) { return 2+2; }
you would instead get
movl $4, %eax
ret
where the 4 comes from the compiler doing the addition itself rather than making the generated program do it (this is called constant folding).
Perhaps more interesting is if you change the code to
int main(int argc, char **argv) { return argc + 2; }
then you get
leal 2(%rdi), %eax
ret
which is doing some real work at runtime! In the 64-bit ELF ABI, %rdi holds the first argument to the function, argc in this case. leal 2(%rdi), %eax is x86 assembly language for "%eax = %edi + 2" and it's being done this way mainly because the more familiar add instruction takes only two arguments, so you can't use it to add 2 to %rdi and put the result in %eax all in one instruction. (Ignore the difference between %rdi and %edi for now.)
The compiler determined that 2+2 = 4 and inlined it. The constant is stored in line 10 (the $4). To verify this, change the math to 2+3 and you will see $5
EDIT: as for the registers themselves, %rsp is the stack pointer, %rbp is the frame pointer, and %eax is a general register
Here is an explanation of the assembly code:
pushq %rbp
This saves a copy of the frame pointer on the stack. The function itself does not need this; it is there so that debuggers or exception handlers can find frames on the stack.
movq %rsp, %rbp
This starts a new frame by setting the frame pointer to point to the current top-of-stack. Again, the function does not need this; it is housekeeping to maintain a proper stack.
mov $4, -12(%rbp)
Here the compiler initializes calc to 4. Several things have happened here. First, the compiler evaluated 2+2 by itself and used the result, 4, in the assembly code. The arithmetic is not performed in the executing program; it was completed in the compiler. Second, calc has been assigned the location 12 bytes below the frame pointer. (This is interesting because it is also below the stack pointer. The OS X ABI for this architecture includes a “red zone” below the stack pointer that programs are permitted to use, which is unusual.) Third, the program was clearly compiled without optimization. We know that because the optimizer would recognize that this code has no effect and is useless, so it would remove it.
movl $0, -8(%rbp)
This code stores 0 in the place the compiler has set aside to prepare the return value of main.
movl -8(%rbp), %eax
movl %eax, -4(%rbp)
This copies data from the place where the return value is prepared to a temporary handling location. This is even more useless than the previous code, reinforcing the conclusion that optimization was not used. This looks like code I would expect at a negative optimization level.
movl -4(%rbp), %eax
This moves the return value from the temporary handling location to the register in which it is returned to the caller.
popq %rbp
This restores the frame pointer, thus removing the previously-pushed frame from the stack.
ret
This puts the program out of its misery.
Your program has no observable behavior, which means that in general case the compiler might not generate any machine code for it at all, besides some minimal startup-wrapup instructions intended to ensure that zero is returned to the calling environment. At least declare your variable as volatile. Or print its value after evaluating it. Or return it from main.
Also note that in C language 2 + 2 qualifies as integral constant expression. This means that compiler is not just allowed, but actually required to know the result of that expression at compile time. Taking this into account, it would be strange to expect the compiler to evaluate 2 + 2 at run time when the final value is known at compile time (even if you completely disable optimizations).
The compiler optimized it away, it pre-computed the answer and just set the result. If you want to see the compiler do the add then you cannot let it "see" the constants you are feeding it
If you compile this code all by itself as an object (gcc -O2 -c test_add.c -o test_add.o)
then you will force the compiler to generate the add code. But the operands will be registers or on the stack.
int test_add ( int a, int b )
{
return(a+b);
}
Then if you call it from code in a separate source (gcc -O2 -c test.c -o test.o) then you will see the two operands be forced into the function.
extern int test_add ( int, int );
int test ( void )
{
return(test_add(2,2));
}
and you can disassemble both of those objects (objdump -D test.o, objdump -D test_add.o)
When you do something that simple in one file
int main ( void )
{
int a,b,c;
a=2;
b=2;
c=a+b;
return(0);
}
The compiler can optimize your code into one of a few equivalents. My example here, does nothing, the math and results have no purpose, they are not used, so they can simply be removed as dead code. Your opitmization did this
int main ( void )
{
int c;
c=4;
return(0);
}
But this is also a perfectly valid optimization of the above code
int main ( void )
{
return(0);
}
EDIT:
Where is the calc=2+2?
I believe the
movl $4,-12(%rbp)
Is the 2+2 (the answer is computed and simply placed in calc which is on the stack.
movl $0,-8(%rbp)
I assume is the 0 in your return(0);
The actual math of adding two numbers was optimized out.
I guess line 10, he optimzed since all are constants
in C, i have this code piece:
int a;
a = 10 + 5 - 3
I want to ask: where is (10+5-3) stored at?
(As far as I know, a is located on stack, how about (10+5-3)? How does this rvalue get calculated?)
Typically, the r-value is "stored" within the program itself.
In other words, the compiler itself (before the program is ever run) computes the 10 + 5 - 3 value (it can do so since since it is all based on constant immediate values), and it emits the assembly code to store the result of this calculation in whatever l-value for the assignement (in this case, the variable named a, which the compiler probably knows as a relative address to a data segment origin of sorts).
The r-value, which has a value of 12 is therefore only found inside the binary of the program, within a assembly instruction that looks like
mov <some dest, typically DS-relative>, $0C
$0C is the "r-value".
If the r-value happened to be the result of a calculation that can only done at run-time, say if the underlying c code was: a = 17 * x; // x some run time var, the r-value would too be "stored" (or rather materialized) as a series of instructions within the program binary. The difference with the simple "mov dest, imm" above is that it would take several instructions to load the variable x in an accumulator, multiply by 17 and store the result at the address where the variable a is. It is possible that the compiler may "authorize itself" ;-) to use the stack for some intermediate result etc. but such would be
a) completely compiler dependent
b) transiant
c) and typically would only involve part of the r-value
it is therefore safe to say that the r-value is a compile-time concept which is encapsulated in parts of the program (not the data), and isn't stored anywhere but in the program binary.
In response to paxdiablo: the explanation offered above is indeed restrictive of the possibilities because the c standard effectively does not dictate anything of that nature. Never the less, most any r-value is eventually materialized, at least in part, by some instructions which sets things up so that the proper value, whether calculated (at run time) or immediate gets addressed properly.
Constants are probably simplified at compile time, so your question as literally posed may not help. But something like, say, i - j + k that does need to be computed at runtime from some variables, may be "stored" wherever the compiler likes, depending on the CPU architecture: the compiler will typically try to do its best to use registers, e.g.
LOAD AX, i
SUB AX, j
ADD AX, k
to compute such an expression "storing" it in the accumulator register AX, before assigning it to some memory location with STORE AX, dest or the like. I'd be pretty surprised if a modern optimizing compiler on an even semi-decent CPU architecture (yeah, x86 included!-) needed to spill registers to memory for any reasonably simple expression!
This is compiler dependent. Usually the value (12) will be calculated by the compiler. It is then stored in the code, typically as part of a load/move immediate assembly instruction.
The result of the computation in the RHS (right-hand-side) is computed by the compiler in a step that's called "constant propagation".
Then, it is stored as an operand of the assembly instruction moving the value into a
Here's a disassembly from MSVC:
int a;
a = 10 + 5 - 3;
0041338E mov dword ptr [a],0Ch
Where it stores it is actually totally up to the compiler. The standard does not dictate this behavior.
A typical place can be seen by actually compiling the code and looking at the assembler output:
int main (int argc, char *argv[]) {
int a;
a = 10 + 5 - 3;
return 0;
}
which produces:
.file "qq.c"
.def ___main;
.scl 2;
.type 32;
.endef
.text
.globl _main
.def _main;
.scl 2;
.type 32;
.endef
_main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl $12, -4(%ebp) ;*****
movl $0, %eax
leave
ret
The relevant bit is marked ;***** and you can see that the value is created by the compiler and just inserted directly into a mov type instruction.
Note that it's only this simple because the expression is a constant value. As soon as you introduce non-constant values (like variables), the code becomes a little more complicated. That's because you have to look those variables up in memory (or they may already be in a register) and then manipulate the values at run-time, not compile-time.
As to how the compiler calculates what the value should be, that's to do with expression evaluation and is a whole other question :-)
Your question is based on an incorrect premise.
The defining property of lvalue in C is that it has a place in storage, i.e it is stored. This is what differentiates lvalue from rvalue. Rvalue is not stored anywhere. That's what makes it an rvalue. If it were stored, it would be lvalue by definition.
The terms "lvalue" and "rvalue" are used to bisect the world of expressions. That is, (10+5-3) is an expression that happens to be an rvalue (because you cannot apply the & operator to it -- in C++ the rules are more complicated). At runtime, there are no expressions, lvalues or rvalues. In particular, they aren't stored anywhere.
You were wondering where the value 12 was stored, but the value 12 is neither an lvalue nor an rvalue (as opposed to the expression 12 which would be an rvalue, but 12 does not appear in your program).