c equivalent of ucomisd

c equivalent of ucomisd - c

While porting a windows binary to linux, I have come across the following set of instructions:
ucomisd xmm5,xmm0
lahf
test ah,0x44
jp 0x42D511
From what I can tell it is comparing the two values in ucomisd, then testing for the presence of either the ZF or PF flags, but not both.
What would the c equivalent be? Every search I do on the topic results in comparing float/double with an epsilon, which this clearly isn't doing.
If it helps, the second operand is always a const value taken from the .rdata section.

The pseudo for ucomisd seems to be
RESULT← UnorderedCompare(DEST[63:0] <> SRC[63:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF←111;
GREATER_THAN: ZF,PF,CF←000;
LESS_THAN: ZF,PF,CF←001;
EQUAL: ZF,PF,CF←100;
ESAC;
OF, AF, SF←0; }
I.e. ZF xor PF would be true if the 2 double-precision operands are equal. If I read correctly this would be
double a, b;
if (a == b) {
...
}
There is no jump if zero and parity flag unset opcode, which is why the lahf + test or some alternative means are required.
Testing this function:
int x(double a, double b) {
return a == b;
}
with GCC produces
xorl %eax, %eax
movl $0, %edx
ucomisd %xmm1, %xmm0
# set al to 1 if no parity flag
setnp %al
# if zero flag not set, zero the return value
cmovne %edx, %eax
ret
i.e. return 1 if no parity flag and equal.

Related

Optimize lookup tables to simple ALU

Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.

The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.

You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.

Trying to reverse engineer a function

I'm trying to understand assembly in x86 more. I have a mystery function here that I know returns an int and takes an int argument.
So it looks like int mystery(int n){}. I can't figure out the function in C however. The assembly is:
mov %edi, %eax
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
add $0x4, %edi
callq < mystery _util >
repz retq
< mystery _util >
mov %edi, %eax
shr %eax
and $0x1, %edi
and %edi, %eax
retq
I don't understand what the lea does here and what kind of function it could be.

The assembly code appeared to be computer generated, and something that was probably compiled by GCC since there is a repz retq after an unconditional branch (call). There is also an indication that because there isn't a tail call (jmp) instead of a call when going to mystery_util that the code was compiled with -O1 (higher optimization levels would likely inline the function which didn't happen here). The lack of frame pointers and extra load/stores indicated that it isn't compiled with -O0
Multiplying x by 7 is the same as multiplying x by 8 and subtracting x. That is what the following code is doing:
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
LEA can compute addresses but it can be used for simple arithmetic as well. The syntax for a memory operand is displacement(base, index, scale). Scale can be 1, 2, 4, 8. The computation is displacement + base + index * scale. In your case lea 0x0(,%rdi, 8), %edi is effectively EDI = 0x0 + RDI * 8 or EDI = RDI * 8. The full calculation is n * 7 - 4;
The calculation for mystery_util appears to simply be
n &= (n>>1) & 1;
If I take all these factors together we have a function mystery that passes n * 7 - 4 to a function called mystery_util that returns n &= (n>>1) & 1.
Since mystery_util returns a single bit value (0 or 1) it is reasonable that bool is the return type.
I was curious if I could get a particular version of GCC with optimization level 1 (-O1) to reproduce this assembly code. I discovered that GCC 4.9.x will yield this exact assembly code for this given C program:
#include<stdbool.h>
bool mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
bool mystery(unsigned int n)
{
return mystery_util (7*n+4);
}
The assembly output is:
mystery_util:
movl %edi, %eax
shrl %eax
andl $1, %edi
andl %edi, %eax
ret
mystery:
movl %edi, %eax
leal 0(,%rdi,8), %edi
subl %eax, %edi
addl $4, %edi
call mystery_util
rep ret
You can play with this code on godbolt.
Important Update - Version without bool
I apparently erred in interpreting the question. I assumed the person asking this question determined by themselves that the prototype for mystery was int mystery(int n). I thought I could change that. According to a related question asked on Stackoverflow a day later, it seems int mystery(int n) is given to you as the prototype as part of the assignment. This is important because it means that a modification has to be made.
The change that needs to be made is related to mystery_util. In the code to be reverse engineered are these lines:
mov %edi, %eax
shr %eax
EDI is the first parameter. SHR is logical shift right. Compilers would only generate this if EDI was an unsigned int (or equivalent). int is a signed type an would generate SAR (arithmetic shift right). This means that the parameter for mystery_util has to be unsigned int (and it follows that the return value is likely unsigned int. That means the code would look like this:
unsigned int mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
int mystery(int n)
{
return mystery_util (7*n+4);
}
mystery now has the prototype given by your professor (bool is removed) and we use unsigned int for the parameter and return type of mystery_util. In order to generate this code with GCC 4.9.x I found you need to use -O1 -fno-inline. This code can be found on godbolt. The assembly output is the same as the version using bool.
If you use unsigned int mystery_util(int n) you would discover that it doesn't quite output what we want:
mystery_util:
movl %edi, %eax
sarl %eax ; <------- SAR (arithmetic shift right) is not SHR
andl $1, %edi
andl %edi, %eax
ret

The LEA is just a left-shift by 3, and truncating the result to 32 bit (i.e. zero-extending EDI into RDI implicilty). x86-64 System V passes the first integer arg in RDI, so all of this is consistent with one int arg. LEA uses memory-operand syntax and machine encoding, but it's really just a shift-and-add instruction. Using it as part of a multiply by a constant is a common compiler optimization for x86.
The compiler that generated this function missed an optimization here; the first mov could have been avoided with
lea 0x0(,%rdi, 8), %eax # n << 3 = n*8
sub %edi, %eax # eax = n*7
lea 4(%rax), %edi # rdi = 4 + n*7
But instead, the compiler got stuck on generating n*7 in %edi, probably because it applied a peephole optimization for the constant multiply too late to redo register allocation.
mystery_util returns the bitwise AND of the low 2 bits of its arg, in the low bit, so a 0 or 1 integer value, which could also be a bool.
(shr with no count means a count of 1; remember that x86 has a special opcode for shifts with an implicit count of 1. 8086 only has counts of 1 or cl; immediate counts were added later as an extension and the implicit-form opcode is still shorter.)

The LEA performs an address computation, but instead of dereferencing the address, it stores the computed address into the destination register.
In AT&T syntax, lea C(b,c,d), reg means reg = C + b + c*d where C is a constant, and b,c are registers and d is a scalar from {1,2,4,8}. Hence you can see why LEA is popular for simple math operations: it does quite a bit in a single instruction. (*includes correction from prl's comment below)
There are some strange features of this assembly code: the repz prefix is only strictly defined when applied to certain instructions, and retq is not one of them (though the general behavior of the processor is to ignore it). See Michael Petch's comment below with a link for more info. The use of lea (,rdi,8), edi followed by sub eax, edi to compute arg1 * 7 also seemed strange, but makes sense once prl noted the scalar d had to be a constant power of 2. In any case, here's how I read the snippet:
mov %edi, %eax ; eax = arg1
lea 0x0(,%rdi, 8), %edi ; edi = arg1 * 8
sub %eax, %edi ; edi = (arg1 * 8) - arg1 = arg1 * 7
add $0x4, %edi ; edi = (arg1 * 7) + 4
callq < mystery _util > ; call mystery_util(arg1 * 7 + 4)
repz retq ; repz prefix on return is de facto nop.
< mystery _util >
mov %edi, %eax ; eax = arg1
shr %eax ; eax = arg1 >> 1
and $0x1, %edi ; edi = 1 iff arg1 was odd, else 0
and %edi, %eax ; eax = 1 iff smallest 2 bits of arg1 were both 1.
retq
Note the +4 on the 4th line is entirely spurious. It cannot affect the outcome of mystery_util.
So, overall this ASM snippet computes the boolean (arg1 * 7) % 4 == 3.

Translating O2 optimized for-loop from assembly to C

This is a homework question.
I am attempting to obtain information from the following assembly code (x86 linux machine, compiled with gcc -O2 optimization). I have commented each section to show what I know. A big chunk of my assumptions could be wrong, but I have done enough searching to the point where I know I should ask these questions here.
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "result %lx\n" //Printed string at end of program
.text
main:
.LFB13:
xorl %esi, %esi // value of esi = 0; x
movl $1, %ecx // value of ecx = 1; result
xorl %edx, %edx // value of edx = 0; Loop increment variable (possibly mask?)
.L2:
movq %rcx, %rax // value of rax = 1; ?
addl $1, %edx // value of edx = 1; Increment loop by one;
salq $3, %rcx // value of rcx = 8; Shift left rcx;
andl $3735928559, %eax // value of eax = 1; Value AND 1 = 1;
orq %rax, %rsi // value of rsi = 1; 1 OR 0 = 1;
cmpl $22, %edx // edx != 22
jne .L2 // if true, go back to .L2 (loop again)
movl $.LC0, %edi // Point to string
xorl %eax, %eax // value of eax = 0;
jmp printf // print
.LFE13: ret // return
And I am supposed to turn it into the following C code with the blanks filled in
#include <stdio.h>
int main()
{
long x = 0x________;
long result = ______;
long mask;
for (mask = _________; mask _______; mask = ________) {
result |= ________;
}
printf("result %lx\n",result);
}
I have a couple of questions and sanity checks that I want to make sure I am getting right since none of the similar examples I have found are for optimized code. Upon compiling some trials myself I get something close but the middle part of L2 is always off.
MY UNDERSTANDING
At the beginning, esi is xor'd with itself, resulting in 0 which is represented by x. 1 is then added to ecx, which would be represented by the variable result.
x = 0; result = 1;
Then, I believe a loop increment variable is stored in edx and set to 0. This will be used in the third part of the for loop (update expression). I also think that this variable must be mask, because later on 1 is added to edx, signifying a loop increment (mask = mask++), along with edx being compared in the middle part of the for loop (test expression aka mask != 22).
mask = 0; (in a way)
The loop is then entered, with rax being set to 1. I don't understand where this is used at all since there is no fourth variable I have declared, although it shows up later to be anded and zeroed out .
movq %rcx, %rax;
The loop variable is then incremented by one
addl $1, %edx;
THE NEXT PART MAKES THE LEAST AMOUNT OF SENSE TO ME
The next three operations I feel make up the body expression of the loop, however I have no idea what to do with them. It would result in something similar to result |= x ... but I don't know what else
salq $3, %rcx
andl $3735928559, %eax
orq %rax, %rsi
The rest I feel I have a good grasp on. A comparison is made ( if mask != 22, loop again), and the results are printed.
PROBLEMS I AM HAVING
I don't understand a couple of things.
1) I don't understand how to figure out my variables. There seem to be 3 hardcoded ones along with one increment or temporary storage variable that is found in the assembly (rax, rcx, rdx, rsi). I think rsi would be the x , and rcx would be result, yet I am unsure of if mask would be rdx or rax, and either way, what would the last variable be?
2) What do the 3 expressions of which I am unsure of do? I feel that I have them mixed up with the incrementation somehow, but without knowing the variables I don't know how to go about solving this.
Any and all help will be great, thank you!

The answer is :
#include <stdio.h>
int main()
{
long x = 0xDEADBEEF;
long result = 0;
long mask;
for (mask = 1; mask != 0; mask = mask << 3) {
result |= mask & x;
}
printf("result %lx\n",result);
}
In the assembly :
rsi is result. We deduce that because it is the only value that get ORed, and it is the second argument of the printf (In x64 linux, arguments are stored in rdi, rsi, rdx, and some others, in order).
x is a constant that is set to 0xDEADBEEF. This is not deductible for sure, but it makes sense because it seems to be set as a constant in the C code, and doesn't seem to be set after that.
Now for the rest, it is obfuscated by an anti-optimization by GCC. You see, GCC detected that the loop would be executed exactly 21 times, and thought is was clever to mangle the condition and replace it by a useless counter. Knowing that, we see that edx is the useless counter, and rcx is mask. We can then deduce the real condition and the real "increment" operation. We can see the <<= 3 in the assembly, and notice that if you shift left a 64-bit int 22 times, it becomes 0 ( shift 3, 22 times means shift 66 bits, so it is all shifted out).
This anti-optimization is sadly really common for GCC. The assembly can be replaced with :
.LFB13:
xorl %esi, %esi
movl $1, %ecx
.L2:
movq %rcx, %rax
andl $3735928559, %eax
orq %rax, %rsi
salq $3, %rcx // implicit test for 0
jne .L2
movl $.LC0, %edi
xorl %eax, %eax
jmp printf
It does exactly the same thing, but we removed the useless counter and saved 3 assembly instructions. It also matches the C code better.

Let's work backwards a bit. We know that result must be the second argument to printf(). In the x86_64 calling convention, that's %rsi. The loop is everything between the .L2 label and the jne .L2 instruction. We see in the template that there's a result |= line at the end of the loop, and indeed, there's an orl instruction there with %rsi as its target, so that checks out. We can now see what it's initialized to at the top of .main.
ElderBug is correct that the compiler spuriously optimized by adding a counter. But we can still figure out: which instruction runs immediately after the |= when the loop repeats? That must be the third part of the loop. What runs immediately before the body of the loop? That must be the loop initialization. Unfortunately, you'll have to figure out what would have happened on the 22nd iteration of the original loop to reverse-engineer the loop condition. (But sal is a left-shift, and that line is a vestige of the original loop condition, which would have been followed by a conditional branch before the %rdx test was inserted.)
Note that the code keeps a copy of the value of mask around in %rcx before modifying it in %rax, and x is folded into a constant (take a close look at the andl line).
Also note that you can feed the .S file to gas to get a .o and see what it does.

interpreting this assembly code

I am trying to interpret the following IA32 assembler code and write a function in C that will have an equivalent effect.
Let's say that parameters a, b and c are stored at memory locations with offsets 8, 12 and 16 relative to the address in register %ebp, and that an appropriate function prototype in C would be equivFunction(int a, int b, int c);
movl 12(%ebp), %edx // store b into %edx
subl 16(%ebp), %edx // %edx = b - c
movl %edx, %eax // store b - c into %eax
sall $31, %eax // multiply (b-c) * 2^31
sarl $31, %eax // divide ((b-c)*2^31)) / 2^31
imull 8(%ebp), %edx // multiply a * (b - c) into %edx
xorl %edx, %eax // exclusive or? %edx or %eax ? what is going on here?
First, did I interpret the assembly correctly? If so, how would I go about translating this into C?

The sall/sarl combo has the effect of setting all bits of eax to the value of the zeroth bit. First, sal moves the 0th bit to the 31st position, making it a sign bit. Then sar moves it back, filling the rest of the register with its copy. Don't think of it as division/multiplication - think of it as bitwise shift, which "s" actually stands for.
So eax is 0xffffffff (-1) if b-c is odd, 0 if even. So the imull command places into edx either a negative of a, or zero. The final xor, then, either inverts the all bits of a (that's what xor with one does) or leaves the zero value be.
This whole snippet has an air of artificiality. Is this homework?

The shifts manipulate the sign bit directly, rather than multiplying/dividing, so the code is roughly
int eqivFunction(int a, int b, int c) {
int t1 = b - c;
unsigned t2 = t1 < 0 ? ~0U : 0;
return (a * t1) ^ t2;
}
Alternately:
int eqivFunction(int a, int b, int c) {
int t1 = b - c;
int t2 = a * t1;
if (t1 < 0) t2 = -t2 - 1;
return t2;
}
Of course, the C code has undefined behavior on integer overflow, while the assembly code is well-defined, so the C code might not do the same thing in all cases (particularly if you compile it on a different architecture)

GCC optimization missed opportunity

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.

The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).

I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().

Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .