simple ADD/ADC ARM assemlby fails

simple ADD/ADC ARM assemlby fails - c

I have the following C and ASM version of the (supposedly) same code. What it does is load 2 128bit ints represented by 2 64bit ints each to registers (first 4*lower 32bit, then 4*higher 32bit) and ADD/ADC to them. It is simple enough code and the ARM/ST manuals actually give the same example with 96bit (3 ADD/ADCs).
For simple calls both versions work (repeatedly adding (1 << x++) or 1..x). But for the longer testsuite the ARM assembly fails (board hangs). ATM I have no ability to trap/debug that and cannot use any printf() or the likes to find the test failing, which is irrelevant anyways, because there must be some basic fault in the ASM version, as the C version works as expected.
I don't get it, it's simple enough and very close to the C assembly output (sans branching). I tried the "memory" constraint (shouldn't be needed), I tried saving the carry between lower and upper 64bit in a register and adding that later, using ADD(C).W, alignment, using two LDR/STR instead of LDRD/STRD, etc.. I assume the board faults because some addition goes wrong and results in a divide by 0 or something like that.
The GCC ASM is below and uses similar basic technique, so I don't see the problem.
I'm really just looking for the fastest way to do the add, not to fix that code specifically. It's a shame you have to use constant register names because there is no constraint for specifying rX and rX+1. Also it's impossible to use as many registers as GCC as it will run out of them during compilation.
typedef struct I128 {
int64_t high;
uint64_t low;
} I128;
I128 I128add(I128 a, const I128 b) {
#if defined(USEASM) && defined(ARMx)
__asm(
"LDRD %%r2, %%r3, %[alo]\n"
"LDRD %%r4, %%r5, %[blo]\n"
"ADDS %%r2, %%r2, %%r4\n"
"ADCS %%r3, %%r3, %%r5\n"
"STRD %%r2, %%r3, %[alo]\n"
"LDRD %%r2, %%r3, %[ahi]\n"
"LDRD %%r4, %%r5, %[bhi]\n"
"ADCS %%r2, %%r2, %%r4\n"
"ADC %%r3, %%r3, %%r5\n"
"STRD %%r2, %%r3, %[ahi]\n"
: [alo] "+m" (a.low), [ahi] "+m" (a.high)
: [blo] "m" (b.low), [bhi] "m" (b.high)
: "r2", "r3", "r4", "r5", "cc"
);
return a;
#else
// faster to use temp than saving low and adding to a directly
I128 r = {a.high + b.high, a.low + b.low};
// check for overflow of low 64 bits, add carry to high
// avoid conditionals
//r.high += r.low < a.low || r.low < b.low;
// actually gcc produces faster code with conditionals
if(r.low < a.low || r.low < b.low) ++r.high;
return r;
}
GCC C version using " armv7m-none-eabi-gcc-4.7.2 -O3 -ggdb -fomit-frame-pointer -falign-functions=16 -std=gnu99 -march=armv7e-m":
b082 sub sp, #8
e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
a908 add r1, sp, #32
e881 000c stmia.w r1, {r2, r3}
e9dd 890e ldrd r8, r9, [sp, #56] ; 0x38
e9dd 670a ldrd r6, r7, [sp, #40] ; 0x28
e9dd 2308 ldrd r2, r3, [sp, #32]
e9dd 450c ldrd r4, r5, [sp, #48] ; 0x30
eb16 0a08 adds.w sl, r6, r8
eb47 0b09 adc.w fp, r7, r9
1912 adds r2, r2, r4
eb43 0305 adc.w r3, r3, r5
45bb cmp fp, r7
bf08 it eq
45b2 cmpeq sl, r6
d303 bcc.n 8012c9a <I128add+0x3a>
45cb cmp fp, r9
bf08 it eq
45c2 cmpeq sl, r8
d204 bcs.n 8012ca4 <I128add+0x44>
2401 movs r4, #1
2500 movs r5, #0
1912 adds r2, r2, r4
eb43 0305 adc.w r3, r3, r5
e9c0 2300 strd r2, r3, [r0]
e9c0 ab02 strd sl, fp, [r0, #8]
e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
b002 add sp, #8
4770 bx lr
My ASM version that fails:
b082 sub sp, #8
b430 push {r4, r5}
a902 add r1, sp, #8
e881 000c stmia.w r1, {r2, r3}
e9dd 2304 ldrd r2, r3, [sp, #16]
e9dd 4508 ldrd r4, r5, [sp, #32]
1912 adds r2, r2, r4
416b adcs r3, r5
e9cd 2304 strd r2, r3, [sp, #16]
e9dd 2302 ldrd r2, r3, [sp, #8]
e9dd 4506 ldrd r4, r5, [sp, #24]
4162 adcs r2, r4
eb43 0305 adc.w r3, r3, r5
e9cd 2302 strd r2, r3, [sp, #8]
4604 mov r4, r0
c90f ldmia r1, {r0, r1, r2, r3}
e884 000f stmia.w r4, {r0, r1, r2, r3}
4620 mov r0, r4
bc30 pop {r4, r5}
b002 add sp, #8
4770 bx lr

I am not getting a hang from your code, but it isnt working either, not sure why. But it was very easy to patch the compiler generated code to handle the carry:
I128 I128add(I128 a, const I128 b) {
I128 r = {a.high + b.high, a.low + b.low};
return r;
}
becomes
000001e4 <I128add>:
1e4: b082 sub sp, #8
1e6: b4f0 push {r4, r5, r6, r7}
1e8: e9dd 4506 ldrd r4, r5, [sp, #24]
1ec: a904 add r1, sp, #16
1ee: e881 000c stmia.w r1, {r2, r3}
1f2: e9dd 230a ldrd r2, r3, [sp, #40] ; 0x28
1f6: 1912 adds r2, r2, r4
1f8: eb43 0305 adc.w r3, r3, r5
1fc: e9dd 6704 ldrd r6, r7, [sp, #16]
200: e9dd 4508 ldrd r4, r5, [sp, #32]
204: 1936 adds r6, r6, r4
206: eb47 0705 adc.w r7, r7, r5
20a: e9c0 6700 strd r6, r7, [r0]
20e: e9c0 2302 strd r2, r3, [r0, #8]
212: bcf0 pop {r4, r5, r6, r7}
214: b002 add sp, #8
216: 4770 bx lr
fixed up the adds
.thumb_func
.globl test2
test2:
sub sp, #8
push {r4, r5, r6, r7}
ldrd r4, r5, [sp, #24]
add r1, sp, #16
stmia r1, {r2, r3}
ldrd r2, r3, [sp, #40]
add r2, r4
adc r3, r5
ldrd r6, r7, [sp, #16]
ldrd r4, r5, [sp, #32]
adc r6, r4
adc r7, r5
strd r6, r7, [r0]
strd r2, r3, [r0, #8]
pop {r4, r5, r6, r7}
add sp, #8
bx lr
final result
00000024 <test2>:
24: b082 sub sp, #8
26: b4f0 push {r4, r5, r6, r7}
28: e9dd 4506 ldrd r4, r5, [sp, #24]
2c: a904 add r1, sp, #16
2e: c10c stmia r1!, {r2, r3}
30: e9dd 230a ldrd r2, r3, [sp, #40] ; 0x28
34: 1912 adds r2, r2, r4
36: 416b adcs r3, r5
38: e9dd 6704 ldrd r6, r7, [sp, #16]
3c: e9dd 4508 ldrd r4, r5, [sp, #32]
40: 4166 adcs r6, r4
42: 416f adcs r7, r5
44: e9c0 6700 strd r6, r7, [r0]
48: e9c0 2302 strd r2, r3, [r0, #8]
4c: bcf0 pop {r4, r5, r6, r7}
4e: b002 add sp, #8
50: 4770 bx lr
Notice the fewer number of thumb2 instructions, unless you are on a cortex-A that has thumb2 support, those fetches from flash (cortex-m) are (can be) slow. I see you are trying to save the push and pop of two more registers but you cost yourself more fetches. You could take the above and still re-arrange the loads and stores and save those two registers.
minimal testing so far. printfs show the upper words adding, where I didnt see that with your code. I am stilll trying to unwind the calling convention (please document your code more for us), looks like r0 is prepped by the caller to place the result, rest is on the stack. I am using a stellaris launchpad (cortex-m4).

Related

Are there are ARM NEON instructions for signed right-shift that round toward zero?

I am trying to implement an algorithm using ARM intrinsics.
On step of the algorithm requires a right shift of a signed integer, but it needs to round negative values upwards (i.e. less negative). For example, if shifting right by one, then -8 should give -4, but -1 should give 0.
In other words, I want something that rounds negative values towards zero instead of rounding down:
int rightshift(int number, unsigned int shift)
{
return ((number < 0) ? -1 : 1) * (abs(number) >> shift);
}
I am not able to find a suitable function to do this in SIMD fashion. Is there a way to do this in one function call, or some trick that could be used?

I don't think there's a single-instruction shift with round-toward-zero behaviour.
However, you can do it fairly simply with a couple of shift and mask instructions. What we need to do is to add one to the result if we started with a negative number and there is a 'carry' out (i.e. any bit to the right of the result would have been 1).
I can demonstrate this with the following pure C code:
#include <stddef.h>
#include <limits.h>
int16_t rightshift(int number, unsigned int shift)
{
static const size_t bits = sizeof number * CHAR_BIT;
number += ((1<<shift) - 1) & (number >> bits-1);
return number >> shift;
}
#include <stdio.h>
int main() {
for (int i = -16; i <= 16; ++i) {
printf(" %3d: ", i);
for (int j = 0; j < 4; ++j)
printf("%4d", rightshift(i, j));
puts("");
}
}
This compiles to some nice branch-fee assembly, which looks amenable to inlining (especially when shift is a compile-time constant):
rightshift:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
movs r3, #1
lsls r3, r3, r1
subs r3, r3, #1
and r3, r3, r0, asr #31
add r0, r0, r3
asrs r0, r0, r1
bx lr
To target Neon, I wrote another function, to exercise it with multiple data:
void do_shift(int16_t *restrict dest, const int16_t *restrict src,
size_t count, unsigned int shift)
{
for (size_t j = 0; j < count; ++j) {
dest[j] = rightshift(src[j], shift);
}
}
And a test program for it:
#include <stdio.h>
int main() {
static const int16_t src[] = {
-32768, -32767, -32766, -32765, -32764,
-16384, -16383, -16382, -16381, -16380,
-8193, -8192, -8191, -8190, -8189,
-16, -15, -14, -13, -12, -10, -9,
-8, -7, -6, -5, -4, -3, -2, -1, 0,
1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16,
1023, 1024, 32767,
};
static const size_t count = sizeof src / sizeof *src;
int16_t dest[16][count];
for (unsigned int i = 0; i < 16; ++i) {
do_shift(dest[i], src, count, i);
}
for (size_t i = 0; i < count; ++i) {
printf("%7d: ", src[i]);
for (int j = 0; j < 16; ++j)
printf("%7d", dest[j][i]);
puts("");
}
}
I compiled this with gcc -O3 -march=armv7 -mfpu=neon. I confess I'm not familiar with the Neon instructions, but the results may be instructive:
do_shift:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
cmp r2, #0
beq .L21
push {r4, r5, r6, r7, r8, lr}
ubfx r4, r1, #1, #2
negs r4, r4
movs r5, #1
and r4, r4, #7
lsls r5, r5, r3
adds r7, r4, #7
subs r6, r2, #1
subs r5, r5, #1
cmp r6, r7
sxth r5, r5
bcc .L8
cmp r4, #0
beq .L9
ldrsh r7, [r1]
cmp r4, #1
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0] # movhi
beq .L9
ldrsh r7, [r1, #2]
cmp r4, #2
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, #2] # movhi
beq .L9
ldrsh r7, [r1, #4]
cmp r4, #3
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, #4] # movhi
beq .L9
ldrsh r7, [r1, #6]
cmp r4, #4
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, #6] # movhi
beq .L9
ldrsh r7, [r1, #8]
cmp r4, #5
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, #8] # movhi
beq .L9
ldrsh r7, [r1, #10]
cmp r4, #7
ite eq
moveq r8, r4
movne r8, #6
and r6, r5, r7, asr #31
add r6, r6, r7
it eq
ldrsheq r7, [r1, #12]
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, #10] # movhi
itttt eq
andeq r6, r5, r7, asr #31
addeq r6, r6, r7
sxtheq r6, r6
asreq r6, r6, r3
it eq
strheq r6, [r0, #12] # movhi
.L4:
vdup.32 q10, r3
sub lr, r2, r4
lsls r4, r4, #1
movs r7, #0
vneg.s32 q10, q10
adds r6, r1, r4
lsr ip, lr, #3
add r4, r4, r0
vdup.16 q12, r5
.L6:
adds r7, r7, #1
adds r6, r6, #16
vldr d18, [r6, #-16]
vldr d19, [r6, #-8]
cmp r7, ip
vshr.s16 q8, q9, #15
vand q8, q8, q12
vadd.i16 q8, q8, q9
vmovl.s16 q9, d16
vmovl.s16 q8, d17
vshl.s32 q9, q9, q10
vshl.s32 q8, q8, q10
vmovn.i32 d22, q9
vmovn.i32 d23, q8
vst1.16 {q11}, [r4]
add r4, r4, #16
bcc .L6
bic r6, lr, #7
cmp lr, r6
add r4, r8, r6
beq .L1
.L3:
ldrsh ip, [r1, r4, lsl #1]
adds r7, r4, #1
cmp r2, r7
and r6, r5, ip, asr #31
add r6, r6, ip
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r4, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #2
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, ip, lsl #1]
adds r7, r4, #3
cmp r2, r7
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #4
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, ip, lsl #1]
adds r7, r4, #5
cmp r2, r7
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #6
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, ip, lsl #1]
adds r7, r4, #7
cmp r2, r7
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #8
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, ip, lsl #1]
add r7, r4, #9
cmp r2, r7
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #10
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, ip, lsl #1]
add r7, r4, #11
cmp r2, r7
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh lr, [r1, r7, lsl #1]
add ip, r4, #12
cmp r2, ip
and r6, r5, lr, asr #31
add r6, r6, lr
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, r7, lsl #1] # movhi
bls .L1
ldrsh r7, [r1, ip, lsl #1]
adds r4, r4, #13
cmp r2, r4
and r6, r5, r7, asr #31
add r6, r6, r7
sxth r6, r6
asr r6, r6, r3
strh r6, [r0, ip, lsl #1] # movhi
bls .L1
ldrsh r1, [r1, r4, lsl #1]
and r2, r5, r1, asr #31
add r2, r2, r1
sxth r2, r2
asr r3, r2, r3
strh r3, [r0, r4, lsl #1] # movhi
.L1:
pop {r4, r5, r6, r7, r8, pc}
.L9:
mov r8, r4
b .L4
.L21:
bx lr
.L8:
movs r4, #0
b .L3
There's a lot of loop unrolling which makes the code longer, but the pattern should be clear.

Replacing memcpy and .divsi3_skip_div0_test with smaller code on ARM MCU

My entry for https://hackaday.com/2016/11/21/step-up-to-the-1-kb-challenge/ includes a couple of huge functions which are not generated from any C code which I have written.
000004e4 <.divsi3_skip_div0_test>:
4e4: b410 push {r4}
4e6: 1c04 adds r4, r0, #0
4e8: 404c eors r4, r1
4ea: 46a4 mov ip, r4
4ec: 2301 movs r3, #1
4ee: 2200 movs r2, #0
4f0: 2900 cmp r1, #0
4f2: d500 bpl.n 4f6 <.divsi3_skip_div0_test+0x12>
4f4: 4249 negs r1, r1
4f6: 2800 cmp r0, #0
4f8: d500 bpl.n 4fc <.divsi3_skip_div0_test+0x18>
4fa: 4240 negs r0, r0
4fc: 4288 cmp r0, r1
4fe: d32c bcc.n 55a <.divsi3_skip_div0_test+0x76>
500: 2401 movs r4, #1
502: 0724 lsls r4, r4, #28
504: 42a1 cmp r1, r4
506: d204 bcs.n 512 <.divsi3_skip_div0_test+0x2e>
508: 4281 cmp r1, r0
50a: d202 bcs.n 512 <.divsi3_skip_div0_test+0x2e>
50c: 0109 lsls r1, r1, #4
50e: 011b lsls r3, r3, #4
510: e7f8 b.n 504 <.divsi3_skip_div0_test+0x20>
512: 00e4 lsls r4, r4, #3
514: 42a1 cmp r1, r4
516: d204 bcs.n 522 <.divsi3_skip_div0_test+0x3e>
518: 4281 cmp r1, r0
51a: d202 bcs.n 522 <.divsi3_skip_div0_test+0x3e>
51c: 0049 lsls r1, r1, #1
51e: 005b lsls r3, r3, #1
520: e7f8 b.n 514 <.divsi3_skip_div0_test+0x30>
522: 4288 cmp r0, r1
524: d301 bcc.n 52a <.divsi3_skip_div0_test+0x46>
526: 1a40 subs r0, r0, r1
528: 431a orrs r2, r3
52a: 084c lsrs r4, r1, #1
52c: 42a0 cmp r0, r4
52e: d302 bcc.n 536 <.divsi3_skip_div0_test+0x52>
530: 1b00 subs r0, r0, r4
532: 085c lsrs r4, r3, #1
534: 4322 orrs r2, r4
536: 088c lsrs r4, r1, #2
538: 42a0 cmp r0, r4
53a: d302 bcc.n 542 <.divsi3_skip_div0_test+0x5e>
53c: 1b00 subs r0, r0, r4
53e: 089c lsrs r4, r3, #2
540: 4322 orrs r2, r4
542: 08cc lsrs r4, r1, #3
544: 42a0 cmp r0, r4
546: d302 bcc.n 54e <.divsi3_skip_div0_test+0x6a>
548: 1b00 subs r0, r0, r4
54a: 08dc lsrs r4, r3, #3
54c: 4322 orrs r2, r4
54e: 2800 cmp r0, #0
550: d003 beq.n 55a <.divsi3_skip_div0_test+0x76>
552: 091b lsrs r3, r3, #4
554: d001 beq.n 55a <.divsi3_skip_div0_test+0x76>
556: 0909 lsrs r1, r1, #4
558: e7e3 b.n 522 <.divsi3_skip_div0_test+0x3e>
55a: 1c10 adds r0, r2, #0
55c: 4664 mov r4, ip
55e: 2c00 cmp r4, #0
560: d500 bpl.n 564 <.divsi3_skip_div0_test+0x80>
562: 4240 negs r0, r0
564: bc10 pop {r4}
566: 4770 bx lr
568: 2800 cmp r0, #0
56a: d006 beq.n 57a <.divsi3_skip_div0_test+0x96>
56c: db03 blt.n 576 <.divsi3_skip_div0_test+0x92>
56e: 2000 movs r0, #0
570: 43c0 mvns r0, r0
572: 0840 lsrs r0, r0, #1
574: e001 b.n 57a <.divsi3_skip_div0_test+0x96>
576: 2080 movs r0, #128 ; 0x80
578: 0600 lsls r0, r0, #24
57a: b407 push {r0, r1, r2}
57c: 4802 ldr r0, [pc, #8] ; (588 <.divsi3_skip_div0_test+0xa4>)
57e: a102 add r1, pc, #8 ; (adr r1, 588 <.divsi3_skip_div0_test+0xa4>)
580: 1840 adds r0, r0, r1
582: 9002 str r0, [sp, #8]
584: bd03 pop {r0, r1, pc}
586: 46c0 nop ; (mov r8, r8)
588: 00000019 .word 0x00000019
and:
000005a4 <memcpy>:
5a4: b5f0 push {r4, r5, r6, r7, lr}
5a6: 2a0f cmp r2, #15
5a8: d935 bls.n 616 <memcpy+0x72>
5aa: 1c03 adds r3, r0, #0
5ac: 430b orrs r3, r1
5ae: 079c lsls r4, r3, #30
5b0: d135 bne.n 61e <memcpy+0x7a>
5b2: 1c16 adds r6, r2, #0
5b4: 3e10 subs r6, #16
5b6: 0936 lsrs r6, r6, #4
5b8: 0135 lsls r5, r6, #4
5ba: 1945 adds r5, r0, r5
5bc: 3510 adds r5, #16
5be: 1c0c adds r4, r1, #0
5c0: 1c03 adds r3, r0, #0
5c2: 6827 ldr r7, [r4, #0]
5c4: 601f str r7, [r3, #0]
5c6: 6867 ldr r7, [r4, #4]
5c8: 605f str r7, [r3, #4]
5ca: 68a7 ldr r7, [r4, #8]
5cc: 609f str r7, [r3, #8]
5ce: 68e7 ldr r7, [r4, #12]
5d0: 3410 adds r4, #16
5d2: 60df str r7, [r3, #12]
5d4: 3310 adds r3, #16
5d6: 42ab cmp r3, r5
5d8: d1f3 bne.n 5c2 <memcpy+0x1e>
5da: 1c73 adds r3, r6, #1
5dc: 011b lsls r3, r3, #4
5de: 18c5 adds r5, r0, r3
5e0: 18c9 adds r1, r1, r3
5e2: 230f movs r3, #15
5e4: 4013 ands r3, r2
5e6: 2b03 cmp r3, #3
5e8: d91b bls.n 622 <memcpy+0x7e>
5ea: 1f1c subs r4, r3, #4
5ec: 08a4 lsrs r4, r4, #2
5ee: 3401 adds r4, #1
5f0: 00a4 lsls r4, r4, #2
5f2: 2300 movs r3, #0
5f4: 58ce ldr r6, [r1, r3]
5f6: 50ee str r6, [r5, r3]
5f8: 3304 adds r3, #4
5fa: 42a3 cmp r3, r4
5fc: d1fa bne.n 5f4 <memcpy+0x50>
5fe: 18ed adds r5, r5, r3
600: 18c9 adds r1, r1, r3
602: 2303 movs r3, #3
604: 401a ands r2, r3
606: d005 beq.n 614 <memcpy+0x70>
608: 2300 movs r3, #0
60a: 5ccc ldrb r4, [r1, r3]
60c: 54ec strb r4, [r5, r3]
60e: 3301 adds r3, #1
610: 4293 cmp r3, r2
612: d1fa bne.n 60a <memcpy+0x66>
614: bdf0 pop {r4, r5, r6, r7, pc}
616: 1c05 adds r5, r0, #0
618: 2a00 cmp r2, #0
61a: d1f5 bne.n 608 <memcpy+0x64>
61c: e7fa b.n 614 <memcpy+0x70>
61e: 1c05 adds r5, r0, #0
620: e7f2 b.n 608 <memcpy+0x64>
622: 1c1a adds r2, r3, #0
624: e7f8 b.n 618 <memcpy+0x74>
626: 46c0 nop ; (mov r8, r8)
I am guessing that I could code far smaller, but less time-efficient, replacements myself.
Is this likely?
Where will I find the source which I need to edit - I'm guessing that I should look for the source of a libc under gcc-arm-none-eabi/lib/gcc/arm-none-eabi/4.8.3/. I think that I've found the compiled symbols, but I can't find the source.
~/gcc-arm-none-eabi$ grep -R divsi3_skip_div0_test *
Binary file lib/gcc/arm-none-eabi/4.8.3/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/thumb/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/armv6-m/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/fpu/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/armv7-ar/thumb/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/armv7-ar/thumb/softfp/libgcc.a matches
Binary file lib/gcc/arm-none-eabi/4.8.3/armv7-ar/thumb/fpu/libgcc.a matches
Alternatively, is there a way to tell gcc not to use memcpy when copying structures? (They are 10 bytes, so three thumb instructions should do the job.) I've tried adding -mno-memcpy and -Wa,mno-memcpy but neither are recognised.
Update:
I've solved the memcpy part of this question - adding a partial, but sufficient, memcpy function stops the other from being added.
size_t memcpy(uint8_t *restrict dst, uint8_t *restrict const src, size_t size) {
int i;
for (i = 0; i < size; i++) {
dst[i] = src[i];
}
return i;
}
It's much smaller, but less efficient and won't handle dst < src + size overlap.
000003ec <memcpy>:
3ec: b510 push {r4, lr}
3ee: 2300 movs r3, #0
3f0: 4293 cmp r3, r2
3f2: d003 beq.n 3fc <memcpy+0x10>
3f4: 5ccc ldrb r4, [r1, r3]
3f6: 54c4 strb r4, [r0, r3]
3f8: 3301 adds r3, #1
3fa: e7f9 b.n 3f0 <memcpy+0x4>
3fc: 1c18 adds r0, r3, #0
3fe: bd10 pop {r4, pc}
To clarify, I'm now only asking what I might do to replace the .divsi3_skip_div0_test code with a less efficient, but smaller, code.
It is not clear to me where the source of this code is, or how to edit its source. It looks to be more complicated to replace than memcpy as it does not look like a C function, as its name begins with a ..

Why does direct accessing to structure members produces significantly more assembly code compared to indirect accessing in GCC?

There are many topics here whether a direct access or an indirect access (via pointer) are faster when accessing structure members, in C.
One example: C pointers vs direct member access for structs
The general opinion is direct access will be faster (at least theoretically) since pointer dereferencing is not used.
So I gave it a try with a chunk of code in my system: GNU Embedded Tools GCC 4.7.4, generating code for ARM (actually ARM-Cortex-A15).
Surprisingly, direct access was much slower. Then I generated the assembly codes for the object file.
Direct access code has 114 lines of assembly code, and indirect access code has 33 lines of assembly code. What is going on here?
Below are the C code and generated assembly code of the functions. The structures are all map to external memory and the structure members are all one-byte word long (unsigned char type).
First function, with indirect access:
void sub_func_1(unsigned int num_of, struct file_s *__restrict__ first_file_ptr, struct file_s *__restrict__ second_file_ptr, struct output_s *__restrict__ output_ptr)
{
if(LIKELY(num_of == 0))
{
output_ptr->curr_id = UNUSED;
output_ptr->curr_cnt = output_ptr->cnt;
output_ptr->curr_mode = output_ptr->_mode;
output_ptr->curr_type = output_ptr->type;
output_ptr->curr_size = output_ptr->size;
output_ptr->curr_allocation_type = output_ptr->allocation_type;
output_ptr->curr_allocation_localized = output_ptr->allocation_localized;
output_ptr->curr_mode_enable = output_ptr->mode_enable;
if(output_ptr->curr_cnt == 1)
{
first_file_ptr->status = BLOCK_IDLE;
first_file_ptr->type = USER_DATA_TYPE;
first_file_ptr->index = FIRST__WORD;
first_file_ptr->layer_cnt = output_ptr->layer_cnt;
second_file_ptr->status = DISABLED;
second_file_ptr->index = 0;
second_file_ptr->redundancy_version = 1;
output_ptr->total_layer_cnt = first_file_ptr->layer_cnt;
}
}
}
00000000 <sub_func_1>:
0: e3500000 cmp r0, #0
4: e92d01f0 push {r4, r5, r6, r7, r8}
8: 1a00001b bne 7c <sub_func_1+0x7c>
c: e5d34007 ldrb r4, [r3, #7]
10: e3a05008 mov r5, #8
14: e5d3c003 ldrb ip, [r3, #3]
18: e5d38014 ldrb r8, [r3, #20]
1c: e5c35001 strb r5, [r3, #1]
20: e5d37015 ldrb r7, [r3, #21]
24: e5d36018 ldrb r6, [r3, #24]
28: e5c34008 strb r4, [r3, #8]
2c: e5d35019 ldrb r5, [r3, #25]
30: e35c0001 cmp ip, #1
34: e5c3c005 strb ip, [r3, #5]
38: e5d34012 ldrb r4, [r3, #18]
3c: e5c38010 strb r8, [r3, #16]
40: e5c37011 strb r7, [r3, #17]
44: e5c3601a strb r6, [r3, #26]
48: e5c3501b strb r5, [r3, #27]
4c: e5c34013 strb r4, [r3, #19]
50: 1a000009 bne 7c <sub_func_1+0x7c>
54: e5d3400b ldrb r4, [r3, #11]
58: e3a05005 mov r5, #5
5c: e5c1c000 strb ip, [r1]
60: e5c10002 strb r0, [r1, #2]
64: e5c15001 strb r5, [r1, #1]
68: e5c20000 strb r0, [r2]
6c: e5c14003 strb r4, [r1, #3]
70: e5c20005 strb r0, [r2, #5]
74: e5c2c014 strb ip, [r2, #20]
78: e5c3400f strb r4, [r3, #15]
7c: e8bd01f0 pop {r4, r5, r6, r7, r8}
80: e12fff1e bx lr
Second function, with direct access:
void sub_func_2(unsigned int output_index, unsigned int cc_index, unsigned int num_of)
{
if(LIKELY(num_of == 0))
{
output_file[output_index].curr_id = UNUSED;
output_file[output_index].curr_cnt = output_file[output_index].cnt;
output_file[output_index].curr_mode = output_file[output_index]._mode;
output_file[output_index].curr_type = output_file[output_index].type;
output_file[output_index].curr_size = output_file[output_index].size;
output_file[output_index].curr_allocation_type = output_file[output_index].allocation_type;
output_file[output_index].curr_allocation_localized = output_file[output_index].allocation_localized;
output_file[output_index].curr_mode_enable = output_file[output_index].mode_enable;
if(output_file[output_index].curr_cnt == 1)
{
output_file[output_index].cc_file[cc_index].file[0].status = BLOCK_IDLE;
output_file[output_index].cc_file[cc_index].file[0].type = USER_DATA_TYPE;
output_file[output_index].cc_file[cc_index].file[0].index = FIRST__WORD;
output_file[output_index].cc_file[cc_index].file[0].layer_cnt = output_file[output_index].layer_cnt;
output_file[output_index].cc_file[cc_index].file[1].status = DISABLED;
output_file[output_index].cc_file[cc_index].file[1].index = 0;
output_file[output_index].cc_file[cc_index].file[1].redundancy_version = 1;
output_file[output_index].total_layer_cnt = output_file[output_index].cc_file[cc_index].file[0].layer_cnt;
}
}
}
00000084 <sub_func_2>:
84: e92d0ff0 push {r4, r5, r6, r7, r8, r9, sl, fp}
88: e3520000 cmp r2, #0
8c: e24dd018 sub sp, sp, #24
90: e58d2004 str r2, [sp, #4]
94: 1a000069 bne 240 <sub_func_2+0x1bc>
98: e3a03d61 mov r3, #6208 ; 0x1840
9c: e30dc0c0 movw ip, #53440 ; 0xd0c0
a0: e340c001 movt ip, #1
a4: e3002000 movw r2, #0
a8: e0010193 mul r1, r3, r1
ac: e3402000 movt r2, #0
b0: e3067490 movw r7, #25744 ; 0x6490
b4: e3068488 movw r8, #25736 ; 0x6488
b8: e3a0b008 mov fp, #8
bc: e3066498 movw r6, #25752 ; 0x6498
c0: e02c109c mla ip, ip, r0, r1
c4: e082c00c add ip, r2, ip
c8: e28c3b19 add r3, ip, #25600 ; 0x6400
cc: e08c4007 add r4, ip, r7
d0: e5d39083 ldrb r9, [r3, #131] ; 0x83
d4: e08c5006 add r5, ip, r6
d8: e5d3a087 ldrb sl, [r3, #135] ; 0x87
dc: e5c3b081 strb fp, [r3, #129] ; 0x81
e0: e5c39085 strb r9, [r3, #133] ; 0x85
e4: e2833080 add r3, r3, #128 ; 0x80
e8: e7cca008 strb sl, [ip, r8]
ec: e5d4a004 ldrb sl, [r4, #4]
f0: e7cca007 strb sl, [ip, r7]
f4: e5d47005 ldrb r7, [r4, #5]
f8: e5c47001 strb r7, [r4, #1]
fc: e7dc6006 ldrb r6, [ip, r6]
100: e5d5c001 ldrb ip, [r5, #1]
104: e5c56002 strb r6, [r5, #2]
108: e5c5c003 strb ip, [r5, #3]
10c: e5d4c002 ldrb ip, [r4, #2]
110: e5c4c003 strb ip, [r4, #3]
114: e5d33005 ldrb r3, [r3, #5]
118: e3530001 cmp r3, #1
11c: 1a000047 bne 240 <sub_func_2+0x1bc>
120: e30dc0c0 movw ip, #53440 ; 0xd0c0
124: e30db0c0 movw fp, #53440 ; 0xd0c0
128: e1a0700c mov r7, ip
12c: e7dfc813 bfi ip, r3, #16, #16
130: e1a05007 mov r5, r7
134: e1a0900b mov r9, fp
138: e02c109c mla ip, ip, r0, r1
13c: e1a04005 mov r4, r5
140: e1a0a00b mov sl, fp
144: e7df9813 bfi r9, r3, #16, #16
148: e7dfb813 bfi fp, r3, #16, #16
14c: e1a06007 mov r6, r7
150: e7dfa813 bfi sl, r3, #16, #16
154: e58dc008 str ip, [sp, #8]
158: e7df6813 bfi r6, r3, #16, #16
15c: e1a0c004 mov ip, r4
160: e7df4813 bfi r4, r3, #16, #16
164: e02b109b mla fp, fp, r0, r1
168: e7df5813 bfi r5, r3, #16, #16
16c: e0291099 mla r9, r9, r0, r1
170: e7df7813 bfi r7, r3, #16, #16
174: e7dfc813 bfi ip, r3, #16, #16
178: e0261096 mla r6, r6, r0, r1
17c: e0241094 mla r4, r4, r0, r1
180: e082b00b add fp, r2, fp
184: e0829009 add r9, r2, r9
188: e02a109a mla sl, sl, r0, r1
18c: e28bbc65 add fp, fp, #25856 ; 0x6500
190: e58d600c str r6, [sp, #12]
194: e2899c65 add r9, r9, #25856 ; 0x6500
198: e3a06005 mov r6, #5
19c: e58d4010 str r4, [sp, #16]
1a0: e59d4008 ldr r4, [sp, #8]
1a4: e0251095 mla r5, r5, r0, r1
1a8: e5cb3000 strb r3, [fp]
1ac: e082a00a add sl, r2, sl
1b0: e59db00c ldr fp, [sp, #12]
1b4: e5c96001 strb r6, [r9, #1]
1b8: e59d6004 ldr r6, [sp, #4]
1bc: e28aac65 add sl, sl, #25856 ; 0x6500
1c0: e58d5014 str r5, [sp, #20]
1c4: e0271097 mla r7, r7, r0, r1
1c8: e0825004 add r5, r2, r4
1cc: e30d40c0 movw r4, #53440 ; 0xd0c0
1d0: e02c109c mla ip, ip, r0, r1
1d4: e0855008 add r5, r5, r8
1d8: e7df4813 bfi r4, r3, #16, #16
1dc: e5ca6002 strb r6, [sl, #2]
1e0: e5d59003 ldrb r9, [r5, #3]
1e4: e082600b add r6, r2, fp
1e8: e59db014 ldr fp, [sp, #20]
1ec: e0201094 mla r0, r4, r0, r1
1f0: e2866c65 add r6, r6, #25856 ; 0x6500
1f4: e59d1010 ldr r1, [sp, #16]
1f8: e306a53c movw sl, #25916 ; 0x653c
1fc: e0827007 add r7, r2, r7
200: e2877c65 add r7, r7, #25856 ; 0x6500
204: e082c00c add ip, r2, ip
208: e5c69003 strb r9, [r6, #3]
20c: e59d6004 ldr r6, [sp, #4]
210: e28ccc65 add ip, ip, #25856 ; 0x6500
214: e082500b add r5, r2, fp
218: e0820000 add r0, r2, r0
21c: e0824001 add r4, r2, r1
220: e085500a add r5, r5, sl
224: e0808008 add r8, r0, r8
228: e7c4600a strb r6, [r4, sl]
22c: e5c56005 strb r6, [r5, #5]
230: e5c73050 strb r3, [r7, #80] ; 0x50
234: e5dc3003 ldrb r3, [ip, #3]
238: e287704c add r7, r7, #76 ; 0x4c
23c: e5c83007 strb r3, [r8, #7]
240: e28dd018 add sp, sp, #24
244: e8bd0ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp}
248: e12fff1e bx lr
And last part, my compile options are:
# Compile options.
C_OPTS = -Wall \
-std=gnu99 \
-fgnu89-inline \
-Wcast-align \
-Werror=uninitialized \
-Werror=maybe-uninitialized \
-Werror=overflow \
-mcpu=cortex-a15 \
-mtune=cortex-a15 \
-mabi=aapcs \
-mfpu=neon \
-ftree-vectorize \
-ftree-slp-vectorize \
-ftree-vectorizer-verbose=4 \
-mfloat-abi=hard \
-O3 \
-flto \
-marm \
-ffat-lto-objects \
-fno-gcse \
-fno-strict-aliasing \
-fno-delete-null-pointer-checks \
-fno-strict-overflow \
-fuse-linker-plugin \
-falign-functions=4 \
-falign-loops=4 \
-falign-labels=4 \
-falign-jumps=4
Update:
Note: I deleted the structure definitions because there was differences with the version of my own program. It is actually a huge structure and it is not efficient to put here completely.
As suggested, I get rid of -fno-gcse, and the generated asm is not huge as before.
Without -fno-gcse, sub_func_1 generates the same code as above.
For sub_func_2:
00000084 <sub_func_2>:
84: e3520000 cmp r2, #0
88: e92d0070 push {r4, r5, r6}
8c: 1a000030 bne 154 <sub_func_2+0xd0>
90: e30d30c0 movw r3, #53440 ; 0xd0c0
94: e3a06008 mov r6, #8
98: e3403001 movt r3, #1
9c: e0030093 mul r3, r3, r0
a0: e3a00d61 mov r0, #6208 ; 0x1840
a4: e0213190 mla r1, r0, r1, r3
a8: e59f30ac ldr r3, [pc, #172] ; 15c <sub_func_2+0xd8>
ac: e0831001 add r1, r3, r1
b0: e2813b19 add r3, r1, #25600 ; 0x6400
b4: e5d34083 ldrb r4, [r3, #131] ; 0x83
b8: e1a00003 mov r0, r3
bc: e5d35087 ldrb r5, [r3, #135] ; 0x87
c0: e5c36081 strb r6, [r3, #129] ; 0x81
c4: e5c34085 strb r4, [r3, #133] ; 0x85
c8: e3064488 movw r4, #25736 ; 0x6488
cc: e2833080 add r3, r3, #128 ; 0x80
d0: e7c15004 strb r5, [r1, r4]
d4: e5d05094 ldrb r5, [r0, #148] ; 0x94
d8: e0844006 add r4, r4, r6
dc: e7c15004 strb r5, [r1, r4]
e0: e5d04095 ldrb r4, [r0, #149] ; 0x95
e4: e5d0c092 ldrb ip, [r0, #146] ; 0x92
e8: e5c04091 strb r4, [r0, #145] ; 0x91
ec: e3064498 movw r4, #25752 ; 0x6498
f0: e7d15004 ldrb r5, [r1, r4]
f4: e5c0c093 strb ip, [r0, #147] ; 0x93
f8: e5d04099 ldrb r4, [r0, #153] ; 0x99
fc: e5c0509a strb r5, [r0, #154] ; 0x9a
100: e5c0409b strb r4, [r0, #155] ; 0x9b
104: e5d33005 ldrb r3, [r3, #5]
108: e3530001 cmp r3, #1
10c: 1a000010 bne 154 <sub_func_2+0xd0>
110: e281cc65 add ip, r1, #25856 ; 0x6500
114: e3a06005 mov r6, #5
118: e2810b19 add r0, r1, #25600 ; 0x6400
11c: e1a0500c mov r5, ip
120: e5cc3000 strb r3, [ip]
124: e1a0400c mov r4, ip
128: e5cc6001 strb r6, [ip, #1]
12c: e5cc2002 strb r2, [ip, #2]
130: e5d0608b ldrb r6, [r0, #139] ; 0x8b
134: e5cc6003 strb r6, [ip, #3]
138: e306c53c movw ip, #25916 ; 0x653c
13c: e7c1200c strb r2, [r1, ip]
140: e5c52041 strb r2, [r5, #65] ; 0x41
144: e285503c add r5, r5, #60 ; 0x3c
148: e5c43050 strb r3, [r4, #80] ; 0x50
14c: e284404c add r4, r4, #76 ; 0x4c
150: e5c0608f strb r6, [r0, #143] ; 0x8f
154: e8bd0070 pop {r4, r5, r6}
158: e12fff1e bx lr
15c: 00000000 .word 0x00000000

TL:DR: can't reproduce that insane compiler output. Maybe the surrounding code + LTO did it?
I do have suggestions to improve the code: see the stuff below about copying whole structs instead of copying many individual members.
The question you linked is about accessing a value-type global directly vs. through a global pointer. On ARM, where it takes multiple instructions or a load from a nearby constant to get an arbitrary 32bit pointer into a register, passing around pointers is better than having each function reference a global directly.
See this example on the Godbolt Compiler Explorer (ARM gcc 4.8.2 -O3)
struct example {
int a, b, c;
} global_example;
int load_global(void) { return global_example.c; }
movw r3, #:lower16:global_example # tmp113,
movt r3, #:upper16:global_example # tmp113,
ldr r0, [r3, #8] #, global_example.c
bx lr #
int load_pointer(struct example *p) { return p->c; }
ldr r0, [r0, #8] #, p_2(D)->c
bx lr #
(Apparently gcc is horrible at passing structs by val as function args, see the code for byval(struct example by_val) on the godbolt link.)
Even worse is if you have a global pointer: first you have to load the value of the pointer, then another load to dereference it. This is the indirection overhead that was being discussed in the question you linked. If both loads miss in cache, you're paying the round-trip latency twice. The load address for the 2nd load isn't available until the first load completes, so no pipelining of those memory requests is possible even on an out-of-order CPU.
If you already have a pointer as an arg, it will be in a register. Dereferencing it is the same as loading from a global. (But better, because you don't need to get the global's address into a register yourself.)
Your real code
I can't reproduce your massive asm output with ARM gcc 4.8.2 on Godbolt, or locally with ARM gcc 5.2.1. I'm not using LTO, though, since I don't have a complete test program.
All I can see is just slightly larger code to do some index math.
bfi is Bitfield Insert. I think 144: e7df9813 bfi r9, r3, #16, #16 is setting the top half of r9 = low half of r3. I don't see how that and mla (integer mul-accumulate) make much sense. Other than perverse results from -ftree-vectorize, all I can think of is maybe -fno-gcse has a really bad impact for the version of gcc you tested.
Is it manipulating constants that are going to be stored? The code you actually posted #defines everything to 0, which gcc takes advantage of. (It also takes advantage of the fact that it already has 1 in a register if curr_cnt == 1, and stores that register for the second_file_ptr->redundancy_version = 1;). ARM doesn't have a str [mem], immediate or anything like x86's mov [mem], imm.
If your compiler output is from code with different values for those constants, the compiler would be doing more work to store different things.
Unfortunately gcc is bad at merging narrow stores into a single wider store (long-standing missed-optimization bug). For x86, clang does this in at least one case, storing 0x0100 (256) instead of a 0 and a 1. (check on godbolt by flipping the compiler to clang 3.7.1 or something, and removing the ARM-specific compiler args. There's a mov word ptr \[rsi\], 256 where gcc uses
mov BYTE PTR [rsi], 0 # *first_file_ptr_23(D).status,
mov BYTE PTR [rsi+1], 1 # *first_file_ptr_23(D).type,
If you arranged your structs carefully, there would be more opportunities for copying 4B blocks in this function.
It might also help to have two identical sub-structs of curr and not-curr, instead of curr_size and size. You might have to declare it packed to avoid padding after the sub-structs, though. Your two groups of members aren't in exactly the same order, which prevents compilers from doing much block-copying anyway when you do a bunch of assignments.
It helps gcc and clang copy multiple bytes at once if you do:
struct output_s_optimized {
struct __attribute__((packed)) stuff {
unsigned char cnt,
mode,
type,
size,
allocation_type,
allocation_localized,
mode_enable;
} curr; // 7B
unsigned char curr_id; // no non-curr id?
struct stuff non_curr;
unsigned char layer_cnt;
// Another 8 byte boundary here
unsigned char total_layer_cnt;
struct cc_file_s cc_file[128];
};
void foo(struct output_s_optimized *p) {
p->curr_id = 0;
p->non_curr = p->curr;
}
void bar(struct output_s_optimized *output_ptr) {
output_ptr->curr_id = 0;
output_ptr->curr.cnt = output_ptr->non_curr.cnt;
output_ptr->curr.mode = output_ptr->non_curr.mode;
output_ptr->curr.type = output_ptr->non_curr.type;
output_ptr->curr.size = output_ptr->non_curr.size;
output_ptr->curr.allocation_type = output_ptr->non_curr.allocation_type;
output_ptr->curr.allocation_localized = output_ptr->non_curr.allocation_localized;
output_ptr->curr.mode_enable = output_ptr->non_curr.mode_enable;
}
gcc 4.8.2 compiles foo() to three copies: byte, 2B, and 4B, even on ARM. It compiles bar() to eight 1B copies, and so does clang-3.8 on x86. So copying whole structs can help your compiler a lot (as well as making sure the data to be copied is arranged in the same order in both locations).
the same code on x86: nothing new
You can use -fverbose-asm to put comments on each line. For x86, the compiler output from gcc 6.1 -O3 is very similar between versions, as you can see on the Godbolt Compiler Explorer. x86 addressing modes can index a global variable directly, so you see stuff like
movzx edi, BYTE PTR [rcx+10] # *output_ptr_7(D)._mode
# where rcx is the output_ptr arg, used directly
vs.
movzx ecx, BYTE PTR output_file[rdi+10] # output_file[output_index_7(D)]._mode
# where rdi = output_index * 1297 (sizeof(output_file[0])), calculated once at the start
(gcc apparently doesn't care that each instruction has a 4B displacement as part of the addressing mode, but this is an ARM question so I won't go tradeoffs between code-size and insn count with x86's variable-length insns.)

In broad (architecture-agnostic) terms, this is what your instructions do:
global_structure_pointer->field = value;
loads the value of global_structure_pointer into an addressing register.
adds the offset represented by field to the addressing register.
stores value into the memory location addressed by the addressing register.
global_structure[index].field = value;
loads the address of global_structure into an addressing register.
loads the value of index into an arithmetic register.
multiplies the arithmetic register by the size of global_structure.
adds the arithmetic register to the addressing register.
stores value into the memory location addressed by the addressing register.
Your confusion seems to be due to a misunderstanding of what "direct access" actually is.
THIS is direct access:
global_structure.field = value;
What you thought of as direct access is in fact indexed access.

What could be the possible reason that an ARM register gets polluted?

Recently I ran into this strange issue(Thumb2 instr-set). Say I have following code:
000a8948 <some_func>:
a8948: e92d 4ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
a894c: 460f mov r7, r1
a894e: f8df 8260 ldr.w r8, [pc, #608]
a8952: 4606 mov r6, r0 --> r0 shows b89a8408 in backtrace
a8954: 4b97 ldr r3, [pc, #604]
a8956: f100 0410 add.w r4, r0, #16 --> only place r4 gets assigned, but it can't be 0.
a895a: 44f8 add r8, pc
a895c: f04f 0900 mov.w r9, #0
a8960: ed2d 8b04 vpush {d8-d9}
a8964: f858 1003 ldr.w r1, [r8, r3]
a8968: b095 sub sp, #84 ; 0x54
a896a: f8df b24c ldr.w fp, [pc, #588]
a896e: ad0c add r5, sp, #48
a8970: f8df c248 ldr.w ip, [pc, #584]
a8974: f8df a248 ldr.w sl, [pc, #584]
a8978: ed9f 8b8b vldr d8, [pc, #556]
a897c: 680a ldr r2, [r1, #0]
a897e: 44fb add fp, pc
a8980: 44fc add ip, pc
a8982: 44fa add sl, pc
a8984: 9107 str r1, [sp, #28]
a8986: 9213 str r2, [sp, #76]
a8988: 6823 ldr r3, [r4, #0] // ERR: SIGSEGV happens, code 1 (SEGV_MAPPER), fault addr 0x0
So when this issue happens, r4 turned to value 0, what I cannot figure out is that how r4 gets polluted, given it only be assigned to r0+16 at line a8956? Could it be that Neon instructions write r4 somehow?

Cortex-M4 SIMD slower than Scalar

I have a few place in my code that could really use some speed up, when I try to use CM4 SIMD instructions the result is always slower than scalar version, for example, this is an alpha blending function I'm using a lot, it's not really slow but it serves as an example:
for (int y=0; y<h; y++) {
i=y*w;
for (int x=0; x<w; x++) {
uint spix = *srcp++;
uint dpix = dstp[i+x];
uint r=(alpha*R565(spix)+(256-alpha)*R565(dpix))>>8;
uint g=(alpha*G565(spix)+(256-alpha)*G565(dpix))>>8;
uint b=(alpha*B565(spix)+(256-alpha)*B565(dpix))>>8;
dstp[i+x]= RGB565(r, g, b);
}
}
R565, G565, B565 and RGB565 are macros that extract and pack RGB565 respectively, please ignore
Now I tried using __SMUAD and see if anything changes, the result was slower (or same speed as the original code) even tried loop unrolling, with no luck:
uint v0, vr, vg, vb;
v0 = (alpha<<16)|(256-alpha);
for (int y=0; y<h; y++) {
i=y*w;
for (int x=0; x<w; x++) {
spix = *srcp++;
dpix = dstp[i+x];
uint vr = R565(spix)<<16 | R565(dpix);
uint vg = G565(spix)<<16 | G565(dpix);
uint vb = B565(spix)<<16 | B565(dpix);
uint r = __SMUAD(v0, vr)>>8;
uint g = __SMUAD(v0, vg)>>8;
uint b = __SMUAD(v0, vb)>>8;
dstp[i+x]= RGB565(r, g, b);
}
}
I know this has been asked before, but given the architectural differences, and the fact that none of the answers really solve my problem, I'm asking again. Thanks!
Update
Scalar disassembly:
Disassembly of section .text.blend:
00000000 <blend>:
0: e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
4: 6846 ldr r6, [r0, #4]
6: 68c4 ldr r4, [r0, #12]
8: b086 sub sp, #24
a: 199e adds r6, r3, r6
c: 9601 str r6, [sp, #4]
e: 9200 str r2, [sp, #0]
10: 68ca ldr r2, [r1, #12]
12: f89d 5038 ldrb.w r5, [sp, #56] ; 0x38
16: 9204 str r2, [sp, #16]
18: 9a01 ldr r2, [sp, #4]
1a: 426e negs r6, r5
1c: 4293 cmp r3, r2
1e: b2f6 uxtb r6, r6
20: da5b bge.n da <blend+0xda>
22: 8809 ldrh r1, [r1, #0]
24: 6802 ldr r2, [r0, #0]
26: 9102 str r1, [sp, #8]
28: fb03 fb01 mul.w fp, r3, r1
2c: 9900 ldr r1, [sp, #0]
2e: 4411 add r1, r2
30: 9103 str r1, [sp, #12]
32: 0052 lsls r2, r2, #1
34: 9205 str r2, [sp, #20]
36: 9903 ldr r1, [sp, #12]
38: 9a00 ldr r2, [sp, #0]
3a: 428a cmp r2, r1
3c: fa1f fb8b uxth.w fp, fp
40: da49 bge.n d6 <blend+0xd6>
42: 4610 mov r0, r2
44: 4458 add r0, fp
46: f100 4000 add.w r0, r0, #2147483648 ; 0x80000000
4a: 9a04 ldr r2, [sp, #16]
4c: f8dd a014 ldr.w sl, [sp, #20]
50: 3801 subs r0, #1
52: eb02 0040 add.w r0, r2, r0, lsl #1
56: 44a2 add sl, r4
58: f834 1b02 ldrh.w r1, [r4], #2
5c: 8842 ldrh r2, [r0, #2]
5e: f3c1 07c4 ubfx r7, r1, #3, #5
62: f3c2 09c4 ubfx r9, r2, #3, #5
66: f001 0c07 and.w ip, r1, #7
6a: f3c1 2804 ubfx r8, r1, #8, #5
6e: fb07 f705 mul.w r7, r7, r5
72: 0b49 lsrs r1, r1, #13
74: fb06 7709 mla r7, r6, r9, r7
78: ea41 01cc orr.w r1, r1, ip, lsl #3
7c: f3c2 2904 ubfx r9, r2, #8, #5
80: f002 0c07 and.w ip, r2, #7
84: fb08 f805 mul.w r8, r8, r5
88: 0b52 lsrs r2, r2, #13
8a: fb01 f105 mul.w r1, r1, r5
8e: 097f lsrs r7, r7, #5
90: fb06 8809 mla r8, r6, r9, r8
94: ea42 02cc orr.w r2, r2, ip, lsl #3
98: fb06 1202 mla r2, r6, r2, r1
9c: f007 07f8 and.w r7, r7, #248 ; 0xf8
a0: f408 58f8 and.w r8, r8, #7936 ; 0x1f00
a4: 0a12 lsrs r2, r2, #8
a6: ea48 0107 orr.w r1, r8, r7
aa: ea41 3142 orr.w r1, r1, r2, lsl #13
ae: f3c2 02c2 ubfx r2, r2, #3, #3
b2: 430a orrs r2, r1
b4: 4554 cmp r4, sl
b6: f820 2f02 strh.w r2, [r0, #2]!
ba: d1cd bne.n 58 <blend+0x58>
bc: 9902 ldr r1, [sp, #8]
be: 448b add fp, r1
c0: 9901 ldr r1, [sp, #4]
c2: 3301 adds r3, #1
c4: 428b cmp r3, r1
c6: fa1f fb8b uxth.w fp, fp
ca: d006 beq.n da <blend+0xda>
cc: 9a00 ldr r2, [sp, #0]
ce: 9903 ldr r1, [sp, #12]
d0: 428a cmp r2, r1
d2: 4654 mov r4, sl
d4: dbb5 blt.n 42 <blend+0x42>
d6: 46a2 mov sl, r4
d8: e7f0 b.n bc <blend+0xbc>
da: b006 add sp, #24
dc: e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
e0: 4770 bx lr
e2: bf00 nop
SIMD disassembly:
sassembly of section .text.blend:
00000000 <blend>:
0: e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
4: 6846 ldr r6, [r0, #4]
6: 68c4 ldr r4, [r0, #12]
8: b086 sub sp, #24
a: 199e adds r6, r3, r6
c: 9601 str r6, [sp, #4]
e: 9200 str r2, [sp, #0]
10: 68ca ldr r2, [r1, #12]
12: f89d 5038 ldrb.w r5, [sp, #56] ; 0x38
16: 9204 str r2, [sp, #16]
18: 9a01 ldr r2, [sp, #4]
1a: f5c5 7680 rsb r6, r5, #256 ; 0x100
1e: 4293 cmp r3, r2
20: ea46 4505 orr.w r5, r6, r5, lsl #16
24: da5d bge.n e2 <blend+0xe2>
26: 8809 ldrh r1, [r1, #0]
28: 6802 ldr r2, [r0, #0]
2a: 9102 str r1, [sp, #8]
2c: fb03 fb01 mul.w fp, r3, r1
30: 9900 ldr r1, [sp, #0]
32: 4411 add r1, r2
34: 9103 str r1, [sp, #12]
36: 0052 lsls r2, r2, #1
38: 9205 str r2, [sp, #20]
3a: 9903 ldr r1, [sp, #12]
3c: 9a00 ldr r2, [sp, #0]
3e: 428a cmp r2, r1
40: fa1f fb8b uxth.w fp, fp
44: da4b bge.n de <blend+0xde>
46: 4610 mov r0, r2
48: 4458 add r0, fp
4a: f100 4000 add.w r0, r0, #2147483648 ; 0x80000000
4e: 9a04 ldr r2, [sp, #16]
50: f8dd a014 ldr.w sl, [sp, #20]
54: 3801 subs r0, #1
56: eb02 0040 add.w r0, r2, r0, lsl #1
5a: 44a2 add sl, r4
5c: f834 2b02 ldrh.w r2, [r4], #2
60: 8841 ldrh r1, [r0, #2]
62: f3c2 07c4 ubfx r7, r2, #3, #5
66: f3c1 06c4 ubfx r6, r1, #3, #5
6a: ea46 4707 orr.w r7, r6, r7, lsl #16
6e: fb25 f707 smuad r7, r5, r7
72: f001 0907 and.w r9, r1, #7
76: ea4f 3c51 mov.w ip, r1, lsr #13
7a: f002 0607 and.w r6, r2, #7
7e: ea4f 3852 mov.w r8, r2, lsr #13
82: ea4c 0cc9 orr.w ip, ip, r9, lsl #3
86: ea48 06c6 orr.w r6, r8, r6, lsl #3
8a: ea4c 4606 orr.w r6, ip, r6, lsl #16
8e: fb25 f606 smuad r6, r5, r6
92: f3c1 2104 ubfx r1, r1, #8, #5
96: f3c2 2204 ubfx r2, r2, #8, #5
9a: ea41 4202 orr.w r2, r1, r2, lsl #16
9e: fb25 f202 smuad r2, r5, r2
a2: f3c6 260f ubfx r6, r6, #8, #16
a6: 097f lsrs r7, r7, #5
a8: f3c6 01c2 ubfx r1, r6, #3, #3
ac: f007 07f8 and.w r7, r7, #248 ; 0xf8
b0: 430f orrs r7, r1
b2: f402 52f8 and.w r2, r2, #7936 ; 0x1f00
b6: ea47 3646 orr.w r6, r7, r6, lsl #13
ba: 4316 orrs r6, r2
bc: 4554 cmp r4, sl
be: f820 6f02 strh.w r6, [r0, #2]!
c2: d1cb bne.n 5c <blend+0x5c>
c4: 9902 ldr r1, [sp, #8]
c6: 448b add fp, r1
c8: 9901 ldr r1, [sp, #4]
ca: 3301 adds r3, #1
cc: 428b cmp r3, r1
ce: fa1f fb8b uxth.w fp, fp
d2: d006 beq.n e2 <blend+0xe2>
d4: 9a00 ldr r2, [sp, #0]
d6: 9903 ldr r1, [sp, #12]
d8: 428a cmp r2, r1
da: 4654 mov r4, sl
dc: dbb3 blt.n 46 <blend+0x46>
de: 46a2 mov sl, r4
e0: e7f0 b.n c4 <blend+0xc4>
e2: b006 add sp, #24
e4: e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
e8: 4770 bx lr
ea: bf00 nop

If you want to truly optimize a function, you should check the assembler output of the compiler. You'll then learn how it is transforming your code, and then you can learn how to write code to help the compiler produce better output, or write the necessary assembler.
One of the easy wins you'd hit on quickly on your alpha blending loop is that division is slow.
Instead of x / 100, use
x * 65536 / 65536 / 100
-> x * (65536 / 100) / 65536
-> x * 655.36 >> 16
-> x * 656 >> 16
An even better alternative would be to use alpha values between 0 -> 256 so that you can just bitshift the result without even needing to do this trick.
One reason why smuad might not giving any benefit is that you're having to move data into a format specifically for this command.
I'm not sure whether you'll be able to do better in general, but I thought I'd point out a way to avoid the division in your sample routine. Also, if you inspect the assembly, you may discover that there is code generation that you don't expect that can be eliminated.

Community Wiki Answer
The major change to accommodate SIMD type instruction is to transform the loading. The SMUAD instruction can be looked at as a 'C' instruction like,
/* a,b are register vectors/arrays of 16 bits */
SMUAD = a[0] * b[0] + a[1] * b[1];
It is very easy to transform these. Instead of,
u16 *srcp, dstp;
uint spix = *srcp++;
uint dpix = dstp[i+x];
Use the full bus and get 32bits at a time,
uint *srcp, *dstp; /* These are two 16 bit values. */
uint spix = *srcp++;
uint dpix = dstp[i+x];
/* scale `dpix` and `spix` by alpha */
spix /= (alpha << 16 | alpha); /* Precompute, reduce strength, etc. */
dpix /= (1-alpha << 16 | 1-alpha); /* Precompute, reduce strength, etc. */
/* Hint, you can use SMUAD here? or maybe not.
You could if you scale at the same time*/
It looks like SMUL is a good fit for the alpha scaling; you don't want to add the two halves.
Now, spix and dpix contain two pixels. The vr synthetic is not needed. You may do two operations at one time.
uint rb = (dpix + spix) & ~GMASK; /* GMASK is x6xx6x bits. */
uint g = (dpix + spix) & GMASK;
/* Maybe you don't care about overflow?
A dual 16bit add helps, if the M4 has it? */
dstp[i+x]= rb | g; /* write 32bits or two pixels at a time. */
Mainly just making better use of the BUS by loading 32bits at a time will definitely speed up your routine. Standard 32 bit integer math may work most of the time if you are careful of the ranges and don't overflow the lower 16bit value in to the upper one.
For blitter code, Bit blog and Bit hacks are useful for extraction and manipulation of RGB565 values; whether SIMD or straight Thumb2 code.
Mainly, it is never a simple re-compile to use SIMD. It can be weeks of work to transform an algorithm. If done properly, SIMD speed ups are significant when the algorithm is not memory bandwidth bound and don't involve many conditionals.

Now with the disassembly posted: You'll see that both the scalar and the simd version have 29 instructions, and the SIMD version actually takes more code space. (Scalar is 0x58 -> 0xba for the inner loop, vs SIMD is (0x5c -> 0xc2))
You can see a lot of instructions are used getting the data into the right format for both loops.. maybe you can improve the performance more by working on the RGB bit unpacking/repacking rather than the alpha blend calculation!
Edit: You may also want to consider processing pairs of pixels at a time.