gcc optimized out unused variable when it should not

gcc optimized out unused variable when it should not - c

Considering the following code which many comes mostly from Bluedroid stack
#include <stdint.h>
#include <assert.h>
#define STREAM_TO_UINT16(u16, p) {u16 = ((uint16_t)(*(p)) + (((uint16_t)(*((p) + 1))) << 8)); (p) += 9;}
void func(uint8_t *param) {
uint8_t *stream = param;
uint16_t handle, handle2;
*stream = 5;
STREAM_TO_UINT16(handle, stream);
STREAM_TO_UINT16(handle2, stream);
assert(handle);
assert(handle2);
*stream = 7;
}
.file "opt.c"
.text
.align 4
.global func
.type func, #function
func:
entry sp, 32
movi.n a8, 5
s8i a8, a2, 0
movi.n a8, 7
s8i a8, a2, 18
retw.n
.size func, .-func
.ident "GCC: (crosstool-NG esp-2020r3) 8.4.0"
When it is compiled with NDEBUG, then assert() resolved to nothing and "handle" is optimized out with -O2 or's' or '3' . As a result, the macro is not expanded and the pointer is not incremented.
I know that I can make "handle" volatile as one option to solve the issue and I agree adding variable modification in macros is dangerous, but this is not my code, this is Bluedroid.
Well first, is this borderline a gcc bug and then is there a way to tell gcc to not optimize out unused variable?

Oops ... no I just re-read the ISA of the eXtensa and I was wrong, the value of a8 is stored where a2 points, with offset, so this is correct. I need to look somewhere else b/c the core of the problem is that as soon as I set NDEBUG, my bluedroid stacks (this is on esp32) stops working, so I was searching for differences and looking where the compiler was whining (unused variables). Thanks for taking the time to answer.

Related

RISC-V inline assembly

I'm quite new to inline assembly, so I need your help to be sure that I use it correctly.
I need to add assembly code inside my C code that is compiled with the Risc-v toolchain. Please consider the following code:
int bar = 0xFF00;
int main(){
volatile int result;
int k;
k = funct();
int* ptr;
ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, a4, a5, a3":
"=m"(*ptr), "=r"(result):
[a5] "m"(*ptr), [a3] "r"(k) :
);
}
...
What I want to do is bar = bar+k. Actually, I want to change the content of the memory location that bar resides in. But the code that I wrote gets the address of bar and adds it to k. Does anybody know what the problem is?

Unfortunately, you have misunderstood the syntax.
In the assembler string, you can either refer to an argument using %0, %1, where the number is the n:th argument passed to the asm directive. Alternatively, you can use the symbolic name, %[myname] which refers to the argument in the form [myname]"r"(k).
Note that the symbolic name is the same as using the number, the name itself doesn't imply anything. In you example, one could get the impression that you are forcing the code to use a specific processor register. (There is another syntax for that, if you really need to use it.)
For example, if you write something like:
int bar = 0xFF00;
int main(){
volatile int result;
int k;
k = funct();
int* ptr;
ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, %[res], %[res], %[ptr]":
[res]"+r"(result) : [ptr]"r"(ptr));
}
The IAR compiler will emit the following. As you can see a0 has been assigned the result variable (using the symbolic name res) and a1 assigned the variable ptr (here, the symbolic name is the same as the variable name).
\ 000014 0001'2503 lw a0, 0x0(sp)
\ 000018 0000'05B7 lui a1, %hi(bar)
\ 00001C 0005'8593 addi a1, a1, %lo(bar)
\ 000020 00B5'0533 .insn r 0x33, 0, 0, a0, a0, a1
\ 000024 00A1'2023 sw a0, 0x0(sp)
You can read more about the IAR inline assembly syntax in the book "IAR C/C++ Development Guide Compiling and linking for RISC-V", in chapter "Assembler Language Interface". The book is provided as a PDF, which you can access from within IAR Embedded Workbench.

Based on the snippet provided in your question, I tried the following code with the IAR C/C++ Compiler for RISC-V:
int funct();
int funct() { return 0xA5; } // stub
int bar = 0xFF00;
int main() {
int k = funct();
int* ptr = &bar;
asm volatile (".insn r 0x33, 0, 0, %[res], %[ptr], %[k]"
: [res]"=r"(*ptr)
: [ptr]"r"(*ptr), [k]"r"(k));
}
In this case, the .insn directive will generate add r,r,r which is effectively *ptr = *ptr + k.
In an earlier version of this answer it was assumed that there would be a requirement to be explicit about which registers to use. For that, explicit register selectors were used as the IAR compiler simply allows it (e.g., "a3", ="a3", "a4", "a5", etc.). At that point, as noted by #PeterCordes in the comments, GCC offered a different set of constraints and would require a different solution. However, if there is no need to be explicit about the registers, it is better to let the compiler decide which ones can be used directly. It will generally impose less overhead.

How to specify default global variable alignment for gcc?

How do I get rid of alignment (.align 4 below) for all global variables by default with GCC, without having to specify __attribute__((aligned(1))) for each variable?
I know that what I ask for is a bad idea to apply universally, becuase on some architectures an alignment of 1 wouldn't work, because e.g. the CPU is not able to dereference an unaligned pointer. Bit in my case I'm writing an i386 bootloader, and unaligned pointers are fine (but slower) there.
Source code (a.c):
__attribute__((aligned(1))) int answer0 = 41;
int answer = 42;
Compiled with: gcc -m32 -Os -S a.c
Assembly output (a.s):
.file "a.c"
.globl answer
.data
.align 4
.type answer, #object
.size answer, 4
answer:
.long 42
.globl answer0
.type answer0, #object
.size answer0, 4
answer0:
.long 41
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",#progbits
The flag gcc -fpack-struct=1 changes the alignment of all struct members and structs to 1. For example, with that flag
struct x { char a; int b; };
struct y { int v : sizeof(char) + sizeof(int) == sizeof(struct x); };
struct z { int b; };
struct x x = { 1, 1 };
int i = 42;
struct z z = { 2 };
compiles to no alignment for variables x' andz', but it still has an .align 4 for the variable i (of type int). I need a solution which also makes int i = 42; unaligned, without having to specify something extra for each such variable.

IMO packing variables to save the space using the packed struct is the easiest and safest way.
example:
#include <stdio.h>
#include <stdint.h>
#define _packed __attribute__((packed))
_packed struct
{
uint8_t x1;
_packed int x2;
_packed uint8_t x3[2];
_packed int x4;
}byte_int;
int main(void) {
printf("%p %p %p %p\n", &byte_int.x1, &byte_int.x2, &byte_int.x3, &byte_int.x4);
printf("%u %u %u %u\n", (unsigned int)&byte_int.x1, (unsigned int)&byte_int.x2, (unsigned int)&byte_int.x3, (unsigned int)&byte_int.x4); // I know it is an UB just to show the op in dec - easier to spot the odd and the even addresses
return 0;
}
https://ideone.com/bY1soH

Most probably gcc doesn't have such a flag which can change the default alignment of global variables.
gcc -fpack-struct=1 can be a workaround, but only for global variables which happen to be of struct type.
Also post-processing the .s output of gcc and removing (some of) the .align lines could work as a workaround.

Does struct with a single member have the same performance as a member type?

Does struct with a single member have the same performance as a member type (memory usage and speed)?
Example:
This code is a struct with a single member:
struct my_int
{
int value;
};
is the performance of my_int same as int ?

Agree with #harper overall, but watch out for the following:
A classic difference is seen with a "unstructured" array and an array in a structure.
char s1[1000];
// vs
typedef struct {
char s2[1000];
} s_T;
s_T s3;
When calling functions ...
void f1(char s[1000]);
void f2(s_T s);
void f3(s_T *s);
// Significant performance difference is not expected.
// In both, only an address is passed.
f1(s1);
f1(s3.s2);
// Significant performance difference is expected.
// In the second case, a copy of the entire structure is passed.
// This style of parameter passing is usually frowned upon.
f1(s1);
f2(s3);
// Significant performance difference is not expected.
// In both, only an address is passed.
f1(s1);
f3(&s3);

In some cases, the ABI may have specific rules for returning structures and passing them to functions. For example, given
struct S { int m; };
struct S f(int a, struct S b);
int g(int a, S b);
calling f or g may, for example, pass a in a register, and pass b on the stack. Similarly, calling g may use a register for the return value, whereas calling f may require the caller to set up a location where f will store its result.
The performance differences of this should normally be negligible, but one case where it could make a significant difference is when this difference enables or disables tail recursion.
Suppose g is implemented as int g(int a, struct S b) { return g(a, b).m; }. Now, on an implementation where f's result is returned the same way as g's, this may compile to (actual output from clang)
.file "test.c"
.text
.globl g
.align 16, 0x90
.type g,#function
g: # #g
.cfi_startproc
# BB#0:
jmp f # TAILCALL
.Ltmp0:
.size g, .Ltmp0-g
.cfi_endproc
.section ".note.GNU-stack","",#progbits
However, on other implementations, such a tail call is not possible, so if you want to achieve the same results for a deeply recursive function, you really need to give f and g the same return type or you risk a stack overflow. (I'm aware that tail calls are not mandated.)
This doesn't mean int is faster than S, nor does it mean that S is faster than int, though. The memory use would be similar regardless of whether int or S is used, so long as the same one is consistently used.

If the compiler has any penalty on using structs instead of single variables is strictly compiler and compiler options dependent.
But there are no reasons why the compiler should make any differences when your struct contains only one member. There should be additional code necessary to access the member nor to derefence any pointer to such an struct. If you don't have this oversimplified structure with one member deferencing might cost one addtional CPU instruction depending on the used CPU.

A minimal example w/ GCC 10.2.0 -O3 gives exactly the same output i.e. no overhead introduced by struct:
diff -u0 <(
gcc -S -o /dev/stdout -x c -std=gnu17 -O3 -Wall -Wextra - <<EOF
// Out of the box
void OOTB(int *n){
*n+=999;
}
EOF
) <(
gcc -S -o /dev/stdout -x c -std=gnu17 -O3 -Wall -Wextra - <<EOF
// One member struct
typedef struct { int inq_n; } inq;
void OMST(inq *n){
n->inq_n+=999;
}
EOF
)
--- /dev/fd/63 [...]
+++ /dev/fd/62 [...]
## -4,3 +4,3 ##
- .globl OOTB
- .type OOTB, #function
-OOTB:
+ .globl OMST
+ .type OMST, #function
+OMST:
## -13 +13 ##
- .size OOTB, .-OOTB
+ .size OMST, .-OMST
Not sure about more realistic/complex situations.

GCC: Prohibit use of some registers

This is a strange request but I have a feeling that it could be possible. What I would like is to insert some pragmas or directives into areas of my code (written in C) so that GCC's register allocator will not use them.
I understand that I can do something like this, which might set aside this register for this variable
register int var1 asm ("EBX") = 1984;
register int var2 asm ("r9") = 101;
The problem is that I'm inserting new instructions (for a hardware simulator) directly and GCC and GAS don't recognise these yet. My new instructions can use the existing general purpose registers and I want to make sure that I have some of them (i.e. r12->r15) reserved.
Right now, I'm working in a mockup environment and I want to do my experiments quickly. In the future I will append GAS and add intrinsics into GCC, but right now I'm looking for a quick fix.
Thanks!

When writing GCC inline assembler, you can specify a "clobber list" - a list of registers that may be overwritten by your inline assembler code. GCC will then do whatever is needed to save and restore data in those registers (or avoid their use in the first place) over the course of the inline asm segment. You can also bind input or output registers to C variables.
For example:
inline unsigned long addone(unsigned long v)
{
unsigned long rv;
asm("mov $1, %%eax;"
"mov %0, %%ebx;"
"add %%eax, %%ebx"
: /* outputs */ "b" (rv)
: /* inputs */ "g" (v) /* select unused general purpose reg into %0 */
: /* clobbers */ "eax"
);
}
For more information, see the GCC-Inline-Asm-HOWTO.

If you use global explicit register variables, these will be reserved throughout the compilation unit, and will not be used by the compiler for anything else (it may still be used by the system's libraries, so choose something that will be restored by those). local register variables do not guarantee that your value will be in the register at all times, but only when referenced by code or as an asm operand.

If you write an inline asm block for your new instructions, there are commands that inform GCC what registers are used by that block and how they are used. GCC will then avoid using those registers or will at least save and reload their contents.

Non-hardcoded scratch register in inline assembly
This is not a direct answer to the original question, but since and since I keep Googling this in that context and since https://stackoverflow.com/a/6683183/895245 was accepted, I'm going to try and provide a possible improvement to that answer.
The improvement is the following: you should avoid hard-coding your scratch registers when possible, to give the register allocator more freedom.
Therefore, as an educational example that is useless in practice (could be done in a single lea (%[in1], %[in2]), %[out];), the following hardcoded scratch register code:
bad.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
__asm__ (
"mov %[in2], %%rax;" /* scratch = in2 */
"add %[in1], %%rax;" /* scratch += in1 */
"mov %%rax, %[out];" /* out = scratch */
: [out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
: "rax"
);
assert(out == 0x100000000);
}
could compile to something more efficient if you instead use this non-hardcoded version:
good.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
uint64_t scratch;
__asm__ (
"mov %[in2], %[scratch];" /* scratch = in2 */
"add %[in1], %[scratch];" /* scratch += in1 */
"mov %[scratch], %[out];" /* out = scratch */
: [scratch] "=&r" (scratch),
[out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
:
);
assert(out == 0x100000000);
}
since the compiler is free to choose any register it wants instead of just rax,
Note that in this example we had to mark the scratch as an early clobber register with & to prevent it from being put into the same register as an input, I have explained that in more detail at: When to use earlyclobber constraint in extended GCC inline assembly? This example also happens to fail in the implementation I tested on without &.
Tested in Ubuntu 18.10 amd64, GCC 8.2.0, compile and run with:
gcc -O3 -std=c99 -ggdb3 -Wall -Werror -pedantic -o good.out good.c
./good.out
Non-hardcoded scratch registers are also mentioned in the GCC manual 6.45.2.6 "Clobbers and Scratch Registers", although their example is too much for mere mortals to take in at once:
Rather than allocating fixed registers via clobbers to provide scratch registers for an asm statement, an alternative is to define a variable and make it an early-clobber output as with a2 and a3 in the example below. This gives the compiler register allocator more freedom. You can also define a variable and make it an output tied to an input as with a0 and a1, tied respectively to ap and lda. Of course, with tied outputs your asm can’t use the input value after modifying the output register since they are one and the same register. What’s more, if you omit the early-clobber on the output, it is possible that GCC might allocate the same register to another of the inputs if GCC could prove they had the same value on entry to the asm. This is why a1 has an early-clobber. Its tied input, lda might conceivably be known to have the value 16 and without an early-clobber share the same register as %11. On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber on a0 is not needed. It is also not desirable in this case. An early-clobber on a0 would cause GCC to allocate a separate register for the "m" ((const double ()[]) ap) input. Note that tying an input to an output is the way to set up an initialized temporary register modified by an asm statement. An input not tied to an output is assumed by GCC to be unchanged, for example "b" (16) below sets up %11 to 16, and GCC might use that register in following code if the value 16 happened to be needed. You can even use a normal asm output for a scratch if all inputs that might share the same register are consumed before the scratch is used. The VSX registers clobbered by the asm statement could have used this technique except for GCC’s limit on the number of asm parameters.
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}

Is GCC broken when taking the address of an argument on ARM7TDMI?

My C code snippet takes the address of an argument and stores it in a volatile memory location (preprocessed code):
void foo(unsigned int x) {
*(volatile unsigned int*)(0x4000000 + 0xd4) = (unsigned int)(&x);
}
int main() {
foo(1);
while(1);
}
I used an SVN version of GCC for compiling this code. At the end of function foo I would expect to have the value 1 stored in the stack and, at 0x40000d4, an address pointing to that value. When I compile without optimizations using the flag -O0, I get the expected ARM7TMDI assembly output (commented for your convenience):
.align 2
.global foo
.type foo, %function
foo:
# Function supports interworking.
# args = 0, pretend = 0, frame = 8
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
sub sp, sp, #8
str r0, [sp, #4] # 3. Store the argument on the stack
mov r3, #67108864
add r3, r3, #212
add r2, sp, #4 # 4. Address of the stack variable
str r2, [r3, #0] # 5. Store the address at 0x40000d4
add sp, sp, #8
bx lr
.size foo, .-foo
.align 2
.global main
.type main, %function
main:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, lr}
mov r0, #1 # 1. Pass the argument in register 0
bl foo # 2. Call function foo
.L4:
b .L4
.size main, .-main
.ident "GCC: (GNU) 4.4.0 20080820 (experimental)"
It clearly stores the argument first on the stack and from there stores it at 0x40000d4. When I compile with optimizations using -O1, I get something unexpected:
.align 2
.global foo
.type foo, %function
foo:
# Function supports interworking.
# args = 0, pretend = 0, frame = 8
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
sub sp, sp, #8
mov r2, #67108864
add r3, sp, #4 # 3. Address of *something* on the stack
str r3, [r2, #212] # 4. Store the address at 0x40000d4
add sp, sp, #8
bx lr
.size foo, .-foo
.align 2
.global main
.type main, %function
main:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, lr}
mov r0, #1 # 1. Pass the argument in register 0
bl foo # 2. Call function foo
.L4:
b .L4
.size main, .-main
.ident "GCC: (GNU) 4.4.0 20080820 (experimental)"
This time the argument is never stored on the stack even though something from the stack is still stored at 0x40000d4.
Is this just expected/undefined behaviour? Have I done something wrong or have I in fact found a Compiler Bug™?

Once you return from foo(), x is gone, and any pointers to it are invalid. Subsequently using such a pointer results in what the C standard likes to call "undefined behavior," which means the compiler is absolutely allowed to assume you won't dereference it, or (if you insist on doing it anyway) need not produce code that does anything remotely like what you might expect. If you want the pointer to x to remain valid after foo() returns, you must not allocate x on foo's stack, period -- even if you know that in principle, nothing has any reason to clobber it -- because that just isn't allowed in C, no matter how often it happens to do what you expect.
The simplest solution might be to make x a local variable in main() (or in whatever other function has a sufficiently long-lived scope) and to pass the address in to foo. You could also make x a global variable, or allocate it on the heap using malloc(), or set aside memory for it in some more exotic way. You can even try to figure out where the top of the stack is in some (hopefully) more portable way and explicitly store your data in some part of the stack, if you're sure you won't be needing for anything else and you're convinced that's what you really need to do. But the method you've been using to do that isn't sufficiently reliable, as you've discovered.

I actually don't think the compiler is wrong, although this is an odd case.
From a code analysis point-of-view, it sees you storing the address of a variable, but that address is never dereferenced and you don't jump outside of the function to external code that could use that address you stored. When you exit the function, the address of the stack is now considered bogus, since its the address of a variable that no longer exists.
The "volatile" keyword really doesn't do much in C, especially with regards to multiple threads or hardware. It just tells the compiler that it has to do the access. However, since there's no users of the value of x according to the data flow, there's no reason to store the "1" on the stack.
It probably would work if you wrote
void foo(unsigned int x) {
volatile int y = x;
*(volatile unsigned int*)(0x4000000 + 0xd4) = (unsigned int)(&y);
}
although it still may be illegal code, since the address of y is considered invalid as soon as foo returns, but the nature of the DMA system would be to reference that location independently of the program flow.

So you're putting the address of a local stack variable into the DMA controller to be used, and then you're returning from the function where the stack variable is available?
While this might work with your main() example (since you aren't writing on the stack again) it won't work in a 'real' program later - that value will be overwritten before or while DMA is accessing it when another function is called and the stack is used again.
You need to have a structure, or a global variable you can use to store this value while the DMA accesses it - otherwise it's just going to get clobbered!
-Adam

One thing to note is that according to the standard, casts are r-values. GCC used to allow it, but in recent versions has become a bit of a standards stickler.
I don't know if it will make a difference, but you should try this:
void foo(unsigned int x) {
volatile unsigned int* ptr = (unsigned int*)(0x4000000 + 0xd4);
*ptr = (unsigned int)(&x);
}
int main() {
foo(1);
while(1);
}
Also, I doubt you intended it, but you are storing the address of the function local x (which is a copy of the int you passed). You likely want to make foo take an "unsigned int *" and pass the address of what you really want to store.
So I feel a more proper solution would be this:
void foo(unsigned int *x) {
volatile unsigned int* ptr = (unsigned int*)(0x4000000 + 0xd4);
*ptr = (unsigned int)(x);
}
int main() {
int x = 1;
foo(&x);
while(1);
}
EDIT: finally, if you code breaks with optimizations it is usually a sign that your code is doing something wrong.

I'm darned if I can find a reference at the moment, but I'm 99% sure that you are always supposed to be able to take the address of an argument, and it's up to the compiler to finesse the details of calling conventions, register usage, etc.
Indeed, I would have thought it to be such a common requirement that it's hard to see there can be general problem in this - I wonder if it's something about the volatile pointers which have upset the optimisation.
Personally, I might do try this to see if it compiled better:
void foo(unsigned int x)
{
volatile unsigned int* pArg = &x;
*(volatile unsigned int*)(0x4000000 + 0xd4) = (unsigned int)pArg;
}

Tomi Kyöstilä wrote
development for the Game Boy Advance.
I was reading about its DMA system and
I experimented with it by creating
single-color tile bitmaps. The idea
was to have the indexed color be
passed as an argument to a function
which would use DMA to fill a tile
with that color. The source address
for the DMA transfer is stored at
0x40000d4.
That's a perfectly valid thing for you to do, and I can see how the (unexpected) code you got with the -O1 optimization wouldn't work.
I see the (expected) code you got with the -O0 optimization does what you expect -- it puts value of the color you want on the stack, and a pointer to that color in the DMA transfer register.
However, even the (expected) code you got with the -O0 optimization wouldn't work, either.
By the time the DMA hardware gets around to taking that pointer and using it to read the desired color, that value on the stack has (probably) long been overwritten by other subroutines or interrupt handlers or both.
And so both the expected and the unexpected code result in the same thing -- the DMA is (probably) going to fetch the wrong color.
I think you really intended to store the color value in some location where it stays safe until the DMA is finished reading it.
So a global variable, or a function-local static variable such as
// Warning: Three Star Programmer at work
// Warning: untested code.
void foo(unsigned int x) {
static volatile unsigned int color = x; // "static" so it's not on the stack
volatile unsigned int** dma_register =
(volatile unsigned int**)(0x4000000 + 0xd4);
*dma_register = &color;
}
int main() {
foo(1);
while(1);
}
Does that work for you?
You see I use "volatile" twice, because I want to force two values to be written in that particular order.

sparkes wrote
If you think you have found a bug in
GCC the mailing lists will be glad you
dropped by but generally they find
some hole in your knowledge is to
blame and mock mercilessly :(
I figured I'd try my luck here first before going to the GCC mailing list to show my incompetence :)
Adam Davis wrote
Out of curiosity, what are you trying
to accomplish?
I was trying out development for the Game Boy Advance. I was reading about its DMA system and I experimented with it by creating single-color tile bitmaps. The idea was to have the indexed color be passed as an argument to a function which would use DMA to fill a tile with that color. The source address for the DMA transfer is stored at 0x40000d4.
Will Dean wrote
Personally, I might do try this to see
if it compiled better:
void foo(unsigned int x)
{
volatile unsigned int* pArg = &x;
*(volatile unsigned int*)(0x4000000 + 0xd4) = (unsigned int)pArg;
}
With -O0 that works as well and with -O1 that is optimized to the exact same -O1 assembly I've posted in my question.

Not an answer, but just some more info for you.
We are running 3.4.5 20051201 (Red Hat 3.4.5-2) at my day job.
We have also noticed some of our code (which I can't post here) stops working when
we add the -O1 flag. Our solution was to remove the flag for now :(

In general I would say, that it is a valid optimization.
If you want to look deeper into it, you could compile with -da
This generates a .c.Number.Passname, where you can have a look at the rtl (intermediate representation within the gcc). There you can see which pass makes which optimization (and maybe disable just the one, you dont want to have)

I think Even T. has the answer. You passed in a variable, you cannot take the address of that variable inside the function, you can take the address of a copy of that variable though, btw that variable is typically a register so it doesnt have an address. Once you leave that function its all gone, the calling function loses it. If you need the address in the function you have to pass by reference not pass by value, send the address. It looks to me that the bug is in your code, not gcc.
BTW, using *(volatile blah *)0xabcd or any other method to try to program registers is going to bite you eventually. gcc and most other compilers have this uncanny way of knowing exactly the worst time to strike.
Say the day you change from this
*(volatile unsigned int *)0x12345 = someuintvariable;
to
*(volatile unsigned int *)0x12345 = 0x12;
A good compiler will realize that you are only storing 8 bits and there is no reason to waste a 32 bit store for that, depending on the architecture you specified, or the default architecture for that compiler that day, so it is within its rights to optimize that to an strb instead of an str.
After having been burned by gcc and others with this dozens of times I have resorted to forcing the issue:
.globl PUT32
PUT32:
str r1,[r0]
bx lr
PUT32(0x12345,0x12);
Costs a few extra clock cycles but my code continues to work yesterday, today, and will work tomorrow with any optimization flag. Not having to re-visit old code and sleeping peacefully through the night is worth a few extra clock cycles here and there.
Also if your code breaks when you compile for release instead of compile for debug, that also means it is most likely a bug in your code.

Is this just expected/undefined
behaviour? Have I done something wrong
or have I in fact found a Compiler
Bug™?
No bug just the defined behaviour that optimisation options can produce odd code which might not work :)
EDIT:
If you think you have found a bug in GCC the mailing lists will be glad you dropped by but generally they find some hole in your knowledge is to blame and mock mercilessly :(
In this case I think it's probably the -O options attempting shortcuts that break your code that need working around.