Micro-optimizing C code for ARM

Micro-optimizing C code for ARM - c

Apparently it's true that on ARM cpus, division is 10-100x slower than bit shifts. On this site it is stated that this can be solved in a number of ways. One of them being look-up tables for small problems, which is fine and standard. But listed was also replacing division with multiplication by a fixed-point reciprocal followed by a bit shift (so that x/3 becomes (x*6) << 1 etc) Another was replacing (x % y) > z with x > (z * y).
I'm far from an expert, but this sounds really odd to me. I mean, if you're using a modern compiler, wouldn't this be exactly the kind of thing that is optimized for you?

unsigned int fun1 ( unsigned int a, unsigned int b )
{
return(a/b);
}
unsigned int fun2 ( unsigned int a )
{
return(a/2);
}
unsigned int fun3 ( unsigned int a )
{
return(a/3);
}
unsigned int fun10 ( unsigned int a )
{
return(a/10);
}
unsigned int fun13 ( void )
{
return(10/13);
}
and just try it.
00000000 <fun1>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <__aeabi_uidiv>
8: e8bd4010 pop {r4, lr}
c: e12fff1e bx lr
00000010 <fun2>:
10: e1a000a0 lsr r0, r0, #1
14: e12fff1e bx lr
00000018 <fun3>:
18: e59f3008 ldr r3, [pc, #8] ; 28 <fun3+0x10>
1c: e0802093 umull r2, r0, r3, r0
20: e1a000a0 lsr r0, r0, #1
24: e12fff1e bx lr
28: aaaaaaab bge feaaaadc <fun13+0xfeaaaa9c>
0000002c <fun10>:
2c: e59f3008 ldr r3, [pc, #8] ; 3c <fun10+0x10>
30: e0802093 umull r2, r0, r3, r0
34: e1a001a0 lsr r0, r0, #3
38: e12fff1e bx lr
3c: cccccccd stclgt 12, cr12, [r12], {205} ; 0xcd
00000040 <fun13>:
40: e3a00000 mov r0, #0
44: e12fff1e bx lr
As one would expect, if the compiler couldn't deal with it compile-time then it calls the appropriate library function, which is the root of the performance issue. If you don't have a native divide instruction then you end up with many instructions executed, plus all of their fetches. 10 to 100 times slower sounds about right.
Interesting that they do use the 1/3 and 1/10th trick here, and if the result can be computed at compile time, then just return the fixed result.
Compiler authors can read the same Hackers Delight and Stack Overflow pages we can and know the same tricks and, if willing and interested, can implement those optimizations. Don't assume they always will; just because I have some version of some compiler that finds these doesn't mean all compilers can/will.
As far as whether you should let the compiler/toolchain do it or not for you: that's up to you; even if you have the divide instruction, if you target multiple platforms you may choose to shift right instead of divide by 2; you may choose to do other of these tricks. If you own the divide then you at least know what it is doing; if you give it over to the compiler then you have to regularly disassemble to understand what it is doing (if you care). If this is in a timing critical section then you may wish to do both, see what the compiler does, then steal that answer or create your own deterministic solution (leaving it up to the compiler is not necessarily deterministic and I think that is the point).
EDIT
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have a gcc 4.8.3 here that also produced those optimizations...as well as a 5.4.0, so they have been doing it for a while.
The arm UMULL instruction is a 64 bit = 32 bit * 32 bit operation, so it can't overflow the multiply. Certainly for 1/3rd and 1/10th and not sure how large a value of N for 1/N you can go in 64 bits and have any 32 bit operand work. Performing a simple experiment shows that at least for these two cases all possible 32 bit patterns work that is for unsigned.
It appears to use the trick for signed as well:
int negfun ( int a )
{
return(a/3);
}
00000000 <negfun>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <negfun+0x10>
4: e0c32390 smull r2, r3, r0, r3
8: e0430fc0 sub r0, r3, r0, asr #31
c: e12fff1e bx lr
10: 55555556 ldrbpl r5, [r5, #-1366] ; 0xfffffaaa

Divide by constant is often optimized by compilers to a multiply and shift sequence even on processors with a divide instruction. In some cases the sequence is a bit longer, but still only uses one multiply. Link to prior thread about this.
Why does GCC use multiplication by a strange number in implementing integer division?
Divide by variable on a processor without a divide is usually handled by an optimized function, based on some variation of the methods mentioned in this wiki article:
http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods
Using a 32 bit by 32 bit divide as an example, there may be 3 main paths used. For divisor < 256, the divide by constant method can be used (256 entry table). For expected quotients < 256, an unfolded subtract and shift sequence may be used. The main path does a table lookup to get an initial approximation, then a sequence that includes 4 multiplies, some adds, subtracts, and shifts to quadruple the number of correct bits from the table value in the estimated quotient such that estimated quotient = actual quotient or actual quotient - 1. Then the product of estimated quotient * divisor is subtracted from dividend, and if remainder >= divisor, quotient is incremented and divisor subtracted from dividend. For a 64 bit by 64 bit divide, the main sequence would involve 6 multiplies, ... to produce the estimated quotient.

Related

Keil ARMCC int64 comparison for Cortex M3

I noticed that armcc generates this kind of code to compare two int64 values:
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
Which can be roughly translated as:
r0 = lower_word_1 ^ lower_word_2
r1 = higher_word_1 ^ higher_word_2
r0 = r1 | r0
jump if r0 is not zero
and something like this, when comparing int64 (int r0,r1) with integral constant (i.e. int, in r3)
0x08000674 4058 EORS r0,r0,r3
0x08000676 4308 ORRS r0,r0,r1
0x08000678 D116 BNE 0x080006A8
with the same idea, just skipping comparing higher words altogether since it just needs to be zero.
but I'm interested - why is it so complicated?
Both cases can be done very straight-forward by comparing lower and higher words and making BNE after both:
for two int64, assuming the same registers
CMP lower words
BNE
CMP higher words
BNE
and for int64 with integral constant:
CMP lower words
BNE
CBNZ if higher word is non-zero
This will take the same number of instructions, each may (or may not, depending on the registers used) be 2 bytes in length.
arm-none-eabi-gcc does something different but no playing around with EORS either
So why armcc does this? I can't see any real benefit; both version require the same number of commands (each of which my be wide or short, so no real profit there).
The only slight benefit I can see is that less branching which my be somewhat beneficial for a flash prefetch buffer. But since there is no cache or branch prediction, I'm not really buying it.
So my reasoning is that this pattern is simply legacy, from ARM7 Architecture where no CBZ/CBNZ existed and mixing ARM and Thumb instructions was not very easy.
Am I missing something?
P.S. Armcc does this on every optimization level so I presume it is some kind of 'hard-coded' piece
UPD: Sure, there is an execution pipeline that will be flushed with every branch taken, however every solution requires at least one conditional branch that will or will not be taken (depending on integers that are compared), so pipeline will be flushed anyway with equal probability.
So I can't really see a point in minimizing conditional branches.
Moreover, if lower and higher words would be compared explicitly and integers are not equal, branch will be taken sooner.
Avoiding branch instruction completely is possible with IT-block but on Cortex-M3 it can be only up to 4 instructions long so I'm gonna ignore this for generality.

The efficiency of the generated code is not counted in the number of the machine code instructions. You need to know the internals of the target machine as well (not only the clock/instruction) but also how the fetch/decode/execute process works.
Every branch instruction in the Cortex M3 devices flushes the pipeline. Pipeline has to be fed again. If you run from FLASH memory (it is slow) wait states will also significantly slow this process. The compiler tries to avoid branches as much as it is possible.
It can be done your way using other instructions:
int foo(int64_t x, int64_t y)
{
return x == y;
}
cmp r1, r3
itte eq
cmpeq r0, r2
moveq r0, #1
movne r0, #0
bx lr
Trust your compiler. People who write them know their trade :). Before you learn more about the ARM Cortex you cant judge the compiler this simple way as you do now.
The code from your example is very well optimized and simple. Keil does a very good job.

As pointed out the difference is branching vs not branching. If you can avoid branching you want to avoid branching.
While the ARM documentation may be interesting, as with an x86 and a full sized ARM and many other places the system plays as of a role here. High performance cores like ones from ARM are sensitive to the system implementation. These cortex-m cores are used in microcontrollers which are quite cost sensitive, so while they blow away a PIC or AVR or msp430 for mips to mhz and mips per dollar they are still cost sensitive. With newer technology or perhaps higher cost, you are starting to see flashes that are at the speed of the processor for the full range (do not have to add wait states at various places across the range of valid clock speeds), but for a long time you saw the flash at half the speed of the core at the slowest core speeds. And then getting worse as you choose higher core speeds. But sram often matching the core. Either way flash is a major portion of the cost of the part and how much and how fast it is to some extent drives part price.
Depending on the core (anything from ARM) the fetch size and as a result alignment varies and as a result benchmarks can be skewed/manipulated based on alignment of a loop style test and how many fetches are needed (trivial to demonstrate with many cortex-ms). The cortex-ms are generally either a halfword or full word fetch and some are compile time options for the chip vendor (so you might have two chips with the same core but the performance varies). And this can be demonstrated too...just not here...unless pushed, I have done this demo too many times at this site now. But we can manage that here in this test.
I do not have a cortex-m3 handy I would have to dig one out and wire it up if need be, should not need to though have a cortex-m4 handy which is also an armv7-m. A NUCLEO-F411RE
Test fixture
.thumb_func
.globl HOP
HOP:
bx r2
.balign 0x20
.thumb_func
.globl TEST0
TEST0:
push {r4,r5}
mov r4,#0
mov r5,#0
ldr r2,[r0]
t0:
cmp r4,r5
beq skip
skip:
subs r1,r1,#1
bne t0
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
The systick timer generally works just fine for these kinds of tests, no need to mess with the debuggers timer it often just shows the same thing with more work. More than enough here.
Called like this with the result printed out in hex
hexstring(TEST0(STK_CVR,0x10000));
hexstring(TEST0(STK_CVR,0x10000));
copy the flash code to ram and execute there
hexstring(HOP(STK_CVR,0x10000,0x20000001));
hexstring(HOP(STK_CVR,0x10000,0x20000001));
Now the stm32's have this cache thing in front of the flash which affects loop based benchmarks like these as well as other benchmarks against these parts, sometimes you cannot get past that and you end up with a bogus benchmark. But not in this case.
To demonstrate fetch effects you want a system delay in fetching, if the fetches are too fast you might not see the fetch effects.
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d1ff bne.n 8000030 <skip>
08000030 <skip>:
00050001 <-- flash time
00050001 <-- flash time
00060004 <-- sram time
00060004 <-- sram time
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d0ff beq.n 8000030 <skip>
08000030 <skip>:
00060001
00060001
00080000
00080000
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: bf00 nop
08000030 <skip>:
00050001
00050001
00060000
00060000
So we can see that if the branch is not taken it is the same as a nop. As far as this loop based test goes. So perhaps there is a branch predictor (often a small cache that remembers the last N number of branches and their destinations and can start prefetch a clock or two early). I did not dig into it yet, did not really need to as we can already see that there is a performance cost due to a branch that has to be taken (making your suggested code not equal despite the same number of instructions, this is the same number of instructions but not equal performance).
So the quickest way to remove the loop and avoid the stm32 cache thing is to do something like this in ram
push {r4,r5}
mov r4,#0
mov r5,#0
cmp r4,r5
ldr r2,[r0]
instruction under test repeated many times
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
with the instruction under test being a bne to the next, a beq to the next or a nop
// 800002e: d1ff bne.n 8000030 <skip>
00002001
// 800002e: d0ff beq.n 8000030 <skip>
00004000
// 800002e: bf00 nop
00001001
I did not have room for 0x10000 instructions so I used 0x1000, and we can see that there is a hit for both branch types with the one that does branch being more costly.
Note that the loop based benchmark did not show this difference, have to be careful doing benchmarks or judging results. Even the ones I have shown here.
I could spend more time tweaking core settings or system settings, but based on experience I think this has already demonstrated the desire not to have a cmp, bne, cbnz replace eor, orr, bne. Now to be fair, your other one where it is a eor.w (thumb2 extensions) that burns more clocks than thumb2 instructions so there is another thing to consider (I measured it as well).
Remember for these high performance cores you need to be very sensitive to fetching and fetch alignment, very easy to make a bad benchmark. Not that an x86 is not high performance, but to make the inefficient core run smoother there is a ton of stuff around it to try to keep the core fed, similar to running a semi-truck vs a sports car, the truck can be efficient once up to speed on the highway but city driving, not so much even keeping to the speed limit a Yugo will get across town faster than the semi truck (if it does not break down). Fetch effects, unaligned transfers, etc are difficult to see in an x86, but an ARM somewhat easy, so to get the best performance you want to avoid the easy cycle eaters.
Edit
Note that I jumped to conclusions too early about what GCC produces. Had to work more on trying to craft an equivalent comparison. I started with
unsigned long long fun2 ( unsigned long long a)
{
if(a==0) return(1);
return(0);
}
unsigned long long fun3 ( unsigned long long a)
{
if(a!=0) return(1);
return(0);
}
00000028 <fun2>:
28: 460b mov r3, r1
2a: 2100 movs r1, #0
2c: 4303 orrs r3, r0
2e: bf0c ite eq
30: 2001 moveq r0, #1
32: 4608 movne r0, r1
34: 4770 bx lr
36: bf00 nop
00000038 <fun3>:
38: 460b mov r3, r1
3a: 2100 movs r1, #0
3c: 4303 orrs r3, r0
3e: bf14 ite ne
40: 2001 movne r0, #1
42: 4608 moveq r0, r1
44: 4770 bx lr
46: bf00 nop
Which used an it instruction which is a natural solution here since the if-then-else cases can be a single instruction. Interesting that they chose to use r1 instead of the immediate #0 I wonder if that is a generic optimization, due to complexity with immediates on a fixed length instruction set or perhaps immediates take less space on some architectures. Who knows.
800002e: bf0c ite eq
8000030: bf00 nopeq
8000032: bf00 nopne
00003002
00003002
800002e: bf14 ite ne
8000030: bf00 nopne
8000032: bf00 nopeq
00003002
00003002
Using sram 0x1000 sets of three instructions linearly, so 0x3002 means 1 clock per instruction on average.
Putting a mov in the it block doesn't change performance
ite eq
moveq r0, #1
movne r0, r1
It is still one clock per.
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a)
{
for(;a!=0;a--)
{
more_fun(5);
}
return(0);
}
48: b538 push {r3, r4, r5, lr}
4a: ea50 0301 orrs.w r3, r0, r1
4e: d00a beq.n 66 <fun4+0x1e>
50: 4604 mov r4, r0
52: 460d mov r5, r1
54: 2005 movs r0, #5
56: f7ff fffe bl 0 <more_fun>
5a: 3c01 subs r4, #1
5c: f165 0500 sbc.w r5, r5, #0
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
66: 2000 movs r0, #0
68: 2100 movs r1, #0
6a: bd38 pop {r3, r4, r5, pc}
This is basically the compare with zero
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
Against another
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a, unsigned long long b)
{
for(;a!=b;a--)
{
more_fun(5);
}
return(0);
}
00000048 <fun4>:
48: 4299 cmp r1, r3
4a: bf08 it eq
4c: 4290 cmpeq r0, r2
4e: d011 beq.n 74 <fun4+0x2c>
50: b5f8 push {r3, r4, r5, r6, r7, lr}
52: 4604 mov r4, r0
54: 460d mov r5, r1
56: 4617 mov r7, r2
58: 461e mov r6, r3
5a: 2005 movs r0, #5
5c: f7ff fffe bl 0 <more_fun>
60: 3c01 subs r4, #1
62: f165 0500 sbc.w r5, r5, #0
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
6e: 2000 movs r0, #0
70: 2100 movs r1, #0
72: bdf8 pop {r3, r4, r5, r6, r7, pc}
74: 2000 movs r0, #0
76: 2100 movs r1, #0
78: 4770 bx lr
7a: bf00 nop
And they choose to use an it block here.
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
It is on par with this for number of instructions.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
But those thumb2 instructions are going to execute longer. So overall I think GCC appears to have made a better sequence, but of course you want to check apples to apples start with the same C code and see what each produced. The gcc one reads easier than the eor/orr stuff, can think less about what it is doing.
8000040: 406c eors r4, r5
00001002
8000042: ea94 0305 eors.w r3, r4, r5
00002001
0x1000 instructions one is two halfwords (thumb2) one is one halfword (thumb). Takes two clocks not really surprised.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
I see six clocks there before adding any other penalties, not four (on this cortex-m4).
Note I made the eors.w aligned and unaligned and it did not change the performance. Still two clocks.

What is the use of SBC instruction in arm?

I understand how the SBC instruction in ARM works.
But, I don't seem to understand how it will be useful, as the intended answer is always less by 1.
Example:
MOV r1, #0x88
MOV r2, #0x44
SUB r3, r1, r2
SBC r4, r1, r2
After this operation, r3 has 0x44 (correct) and r4 has 0x43 (incorrect).
I don't see in which case SBC is a more relevant operation than SUB.
Thanks.

This operation is a substration that adds the carry (PSTATE.C) to the result:
r4 = r1 - r2 - (1-CPSR.C)
CPSR.NZCV has been set by a previous operation that sets flags (For example CMP orADDS).
This type of operation can be useful for large integer additions.
For example, in Aarch32 if you want to calculate a 64-bit addition, you add the 32-bit bottom bits (ADDS) then use ADDC to do the top 32-bit with carry propagation.

Why does gcc compile f(1199) and f(1200) differently?

What causes GCC 7.2.1 on ARM to use a load from memory (lr) for certain constants, and an immediate (mov) in some other cases? Concretely, I'm seeing the following:
GCC 7.2.1 for ARM compiles this:
extern void abc(int);
int test() { abc(1199); return 0; }
…into that:
test():
push {r4, lr}
ldr r0, .L4 // ??!
bl abc(int)
mov r0, #0
pop {r4, lr}
bx lr
.L4:
.word 1199
and this:
extern void abc(int);
int test() { abc(1200); return 0; }
…into that:
test():
push {r4, lr}
mov r0, #1200 // OK
bl abc(int)
mov r0, #0
pop {r4, lr}
bx lr
At first I expected 1200 to be some sort of unique cutoff, but there are other cut-offs like this at 1024 (1024 yields a mov r0, #1024, whereas 1025 uses ldr) and at other values.
Why would GCC use a load from memory to fetch a constant, rather than using an immediate?

This has to do with the way that constant operands are encoded in the ARM instruction set. They are encoded as an (unsigned) 8-bit constant combined with a 4 bit rotate field -- the 8 bit value will be rotated by 2 times the value in that 4 bit field. So any value that fits in that form can be used as a constant argument.
The constant 1200 is 10010110000 in binary, so it can be encoded as the 8-bit constant 01001011 combined with a rotate of 4.
The constant 1199 is 10010101111 in binary, so there's no way to fit it in an ARM constant operand.

ARM Parameter Passing

I am trying to write an ARM program that takes three numbers and calculates the discriminant. It has two source files, driver.s & prog3.s. I understand how to find the discriminate, but how do I pass the values A, B, & C into the discrim function from the main function? I have included the code I typed thus far....
MAIN() driver.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
loopcount .req r4
readA:
.ascii “%d”
readB:
.ascii “%d”
readC:
.ascii “%d”
addressReadA: .word readA
addressReadB: .word readB
addressReadC: .word readC
main:
ldr avalue, addressReadA # load in avalue
ldr bvalue, addressReadB # load in bvalue
ldr cvalue, addressReadC # load in cvalue
DISCRIM() prog3.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
discrim:
mul bvalue, bvalue, bvalue # square bvalue
mul avalue, avalue, #4 # multiply avalue by 4
mul cvalue, avalue, cvalue # multiply avalue by cvalue
add final, bvalue, cvalue # calculated discriminant

Going with the calling convention that C compilers use is not a bad idea, esp since if you go from pure assembly programs to C and asm mixed, you already have that experience. And/or you may see the simplicity and wisdom in the calling conventions used.
How do you know what the calling convention for a compiler is? 1) read the manual/documentation and google. 2) just try it. Prototype a function that is similar in the number of operands the type of operands and return value and feed it real-ish numbers and see what it produces.
Compiling to asm sometimes works but with pseudo instructions and other things done by the assembler I prefer to dissemble than to compile to asm YMMV.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3));
}
which with gnu currently results in
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Each combination of compiler and target may have a different calling convention, there is no reason to assume that different compilers or versions of the same compiler use the same convention. ARM, MIPS, and no doubt others try to help/encourage/suggest a calling convention to use and some compilers simply follow that, why not.
There are lots of exceptions to the rule in the convention, but for ARM for the first up to four registers worth of parameters, in this case for up to four signed or unsigned integers or up to four less than or equal to 32 bit quantities (float can create exceptions) the first four general purposes regisers are used r0 for the first parameter r1 for the second and so on. And currently the standard keeps the stack aligned on 64 bit boundaries.
So we see that the first parameter is indeed placed in r0 the second in r1 and third in r2, obviously you dont have to arrange those three instructions in that order, doesnt matter.
because this function is calling another function it has to preserve its return value in lr so that goes on the stack, because the standard says to keep the stack aligned on 64 bit boundaries they are pushing another register on the stack r4 is arbitrary it could be any register, this is the one the tool chose.
because the standard says to return in r0, code that implements one of these functions.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c )
{
return(a+b^c);
}
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e0200002 eor r0, r0, r2
8: e12fff1e bx lr
it is very interesting now that I see this that the compiler did not do a tail optimization on the call, it could have not saved lr and did a branch to fun, since the return value in r0 is what test() was also returning in the same register. really kind of baffled that that didnt happen.
but you can see that indeed the return value is left in r0, and per the convention we can trash r0-r3 we dont have to preserve them, and these functions are not.
if you change test to this
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3)+7);
}
then it cant tail optimize and also shows the return register so you dont have to create a fun() function to see it.
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e2800007 add r0, r0, #7
1c: e12fff1e bx lr
you can do this kind of thing with other targets or other compilers, and there is no reason to assume that one target has the same convention as another.
Disassembly of section .text:
00000000 <fun>:
0: 0f 5e add r14, r15
2: 0f ed xor r13, r15
4: 30 41 ret
0000000000000000 <fun>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: 31 d0 xor %edx,%eax
5: c3 retq
and this one is stack based instead of register based
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d41 0004 mov 4(r5), r1
8: 6d41 0006 add 6(r5), r1
c: 1d40 0008 mov 10(r5), r0
10: 7840 xor r1, r0
12: 1585 mov (sp)+, r5
14: 0087 rts pc
But if this is just a pure assembly project and you dont have to interface with compiled output, do whatever you want, part of designing the project is not just each individual function but how they interact, no different than C or Python or some other language you have to still define the interface for yourself between functions. Assembly doesnt make that special or different, just another language.

How does this disassembly correspond to the given C code?

Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.
Here's a disassembler listing of some code I'm looking at:
//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00 nop
14: f04f 0300 mov.w r3, #0
18: f2c4 0302 movt r3, #16386 ; 0x4002
1c: 781b ldrb r3, [r3, #0]
1e: b2db uxtb r3, r3
20: b2db uxtb r3, r3
22: b25b sxtb r3, r3
24: 2b00 cmp r3, #0
26: daf5 bge.n 14 <flash_command+0x14>
The constants (after expending macros, etc.) are:
address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u
This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.
I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?
What am I missing here?
Edit: Types of the thing involved:
So the actual macro expansion of FTFE_FSTAT comes down to
((((FTFE_MemMapPtr)0x40020000u))->FSTAT)
where the struct is defined as
/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
//... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;

The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.
The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.
The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.
Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.
The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight