Neon 64bit aarch64: confusion about ld4r

Neon 64bit aarch64: confusion about ld4r - arm

I'm confused by the new ld4r instruction in aarch64.
The following code (loading SAME 4 32-bit values into v[20-23]):
ld1 { v20.4s }, [out1]
ld1 { v21.4s }, [out1]
ld1 { v22.4s }, [out1]
ld1 { v23.4s }, [out1]
seems to be equivalent to the following code:
ld1 { v20.4s }, [out1]
mov v21.16b, v20.16b
mov v22.16b, v20.16b
mov v23.16b, v20.16b
but it doesn't seem to be equivalent to the following line:
ld4r { v20.4s, v21.4s, v22.4s, v23.4s }, [out1]
Am I misreading the ld4r instruction? Isn't it supposed to replicate over 4 lanes?

It seems that ld4r only loads a single 4-element structure and it replicates it across the SAME lane. This is not a lane-to-lane replication.

Related

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization:
#include <arm_neon.h>
void bug(int8_t *out, const int8_t *in) {
for (int i = 0; i < 2; i++) {
int8x16x4_t x;
x.val[0] = vld1q_s8(&in[16 * i]);
x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7);
vst4q_s8(&out[64 * i], x);
}
}
NOTE: this is a minimally reproducible version of an issue that is popping up in many different functions of my actual, much more complex code, filled with arithmetic/logical/permutation instructions performing a totally different operation from above. Please refrain from criticizing and/or suggesting different ways of doing what the code above does, unless it has an effect on the code generation issue discussed below.
clang generates sane code:
bug(signed char*, signed char const*): // #bug(signed char*, signed char const*)
ldr q0, [x1]
sshr v1.16b, v0.16b, #7
mov v2.16b, v1.16b
mov v3.16b, v1.16b
st4 { v0.16b, v1.16b, v2.16b, v3.16b }, [x0], #64
ldr q0, [x1, #16]
sshr v1.16b, v0.16b, #7
mov v2.16b, v1.16b
mov v3.16b, v1.16b
st4 { v0.16b, v1.16b, v2.16b, v3.16b }, [x0]
ret
As for gcc, it inserts a lot of unnecessary operations, apparently zeroing out the registers that will be eventually input to the st4 instruction:
bug(signed char*, signed char const*):
sub sp, sp, #128
# mov x9, 0
# mov x8, 0
# mov x7, 0
# mov x6, 0
# mov x5, 0
# mov x4, 0
# mov x3, 0
# stp x9, x8, [sp]
# mov x2, 0
# stp x7, x6, [sp, 16]
# stp x5, x4, [sp, 32]
# str x3, [sp, 48]
ldr q0, [x1]
# stp x2, x9, [sp, 56]
# stp x8, x7, [sp, 72]
sshr v4.16b, v0.16b, 7
# str q0, [sp]
# ld1 {v0.16b - v3.16b}, [sp]
# stp x6, x5, [sp, 88]
mov v1.16b, v4.16b
# stp x4, x3, [sp, 104]
mov v2.16b, v4.16b
# str x2, [sp, 120]
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0], 64
### ldr q4, [x1, 16]
### add x1, sp, 64
### str q4, [sp, 64]
sshr v4.16b, v4.16b, 7
### ld1 {v0.16b - v3.16b}, [x1]
mov v1.16b, v4.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0]
add sp, sp, 128
ret
I manually prefixed with # all instructions that could be safely taken out, without affecting the result of the function.
In addition, the instructions prefixed with ### perform an unnecessary trip to memory and back (and anyway, the mov instructions following ### ld1 ... overwrite 3 out of 4 registers loaded by that ld1 instruction), and could be replaced by a single load straight to v0.16b -- and the sshr instruction in the middle of the block would then use v0.16b as its source register.
As far as I know, x, being a local variable, can be used unitialized; and even if it weren't, all registers are properly initialized, so there's no point in zeroing them out just to immediately overwrite them with values.
I'm inclined to think this is a gcc bug, but before reporting it, I'm curious if I missed something. Maybe there's a compilation flag, an __attribute__ or something else that I could to make gcc generate sane code.
Thus, my question: is there anything I can do to generate sane code, or is this a bug I need to report to gcc?

Code generation on a fairly current development version of gcc appears to have improved immensely, at least for this case.
After installing the gcc-snapshot package (dated 20210918), gcc generates the following code:
bug:
ldr q5, [x1]
sshr v4.16b, v5.16b, 7
mov v0.16b, v5.16b
mov v1.16b, v4.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0], 64
ldr q4, [x1, 16]
mov v0.16b, v4.16b
sshr v4.16b, v4.16b, 7
mov v1.16b, v4.16b
mov v2.16b, v4.16b
mov v3.16b, v4.16b
st4 {v0.16b - v3.16b}, [x0]
ret
Not ideal yet -- at least two mov instruction could be removed per iteration by changing the destination registers of ldr and sshr, but considerably better than before.

Short answer: welcome to GCC. Do not bother optimizing anything while you are using it. And Clang isn't better either.
Secret tip: Add ARM and ARM64 components to Visual Studio, and you'd be surprised how well it works. The problem is however, it generates COFF binary, not ELF, and I haven't been able to find a converter.
You can use Ida Pro or dumpbin and generate a disassembly file and it look. like:
; void __fastcall bug(char *out, const char *in)
EXPORT bug
bug
MOV W10, #0
MOV W9, #0
$LL4 ; CODE XREF: bug+30↓j
ADD X8, X1, W9,SXTW
ADD W9, W9, #0x10
CMP W9, #0x20 ; ' '
LD1 {V0.16B}, [X8]
ADD X8, X0, W10,SXTW
ADD W10, W10, #0x40 ; '#'
SSHR V1.16B, V0.16B, #7
MOV V2.16B, V1.16B
MOV V3.16B, V1.16B
ST4 {V0.16B-V3.16B}, [X8]
B.LT $LL4
RET
; End of function bug
You can copy paste the disassembly to a GCC assembly file.
And don't bother with reporting the "bug" either. If they were listening, GCC wouldn't be this bad in first place.

Keil ARMCC int64 comparison for Cortex M3

I noticed that armcc generates this kind of code to compare two int64 values:
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
Which can be roughly translated as:
r0 = lower_word_1 ^ lower_word_2
r1 = higher_word_1 ^ higher_word_2
r0 = r1 | r0
jump if r0 is not zero
and something like this, when comparing int64 (int r0,r1) with integral constant (i.e. int, in r3)
0x08000674 4058 EORS r0,r0,r3
0x08000676 4308 ORRS r0,r0,r1
0x08000678 D116 BNE 0x080006A8
with the same idea, just skipping comparing higher words altogether since it just needs to be zero.
but I'm interested - why is it so complicated?
Both cases can be done very straight-forward by comparing lower and higher words and making BNE after both:
for two int64, assuming the same registers
CMP lower words
BNE
CMP higher words
BNE
and for int64 with integral constant:
CMP lower words
BNE
CBNZ if higher word is non-zero
This will take the same number of instructions, each may (or may not, depending on the registers used) be 2 bytes in length.
arm-none-eabi-gcc does something different but no playing around with EORS either
So why armcc does this? I can't see any real benefit; both version require the same number of commands (each of which my be wide or short, so no real profit there).
The only slight benefit I can see is that less branching which my be somewhat beneficial for a flash prefetch buffer. But since there is no cache or branch prediction, I'm not really buying it.
So my reasoning is that this pattern is simply legacy, from ARM7 Architecture where no CBZ/CBNZ existed and mixing ARM and Thumb instructions was not very easy.
Am I missing something?
P.S. Armcc does this on every optimization level so I presume it is some kind of 'hard-coded' piece
UPD: Sure, there is an execution pipeline that will be flushed with every branch taken, however every solution requires at least one conditional branch that will or will not be taken (depending on integers that are compared), so pipeline will be flushed anyway with equal probability.
So I can't really see a point in minimizing conditional branches.
Moreover, if lower and higher words would be compared explicitly and integers are not equal, branch will be taken sooner.
Avoiding branch instruction completely is possible with IT-block but on Cortex-M3 it can be only up to 4 instructions long so I'm gonna ignore this for generality.

The efficiency of the generated code is not counted in the number of the machine code instructions. You need to know the internals of the target machine as well (not only the clock/instruction) but also how the fetch/decode/execute process works.
Every branch instruction in the Cortex M3 devices flushes the pipeline. Pipeline has to be fed again. If you run from FLASH memory (it is slow) wait states will also significantly slow this process. The compiler tries to avoid branches as much as it is possible.
It can be done your way using other instructions:
int foo(int64_t x, int64_t y)
{
return x == y;
}
cmp r1, r3
itte eq
cmpeq r0, r2
moveq r0, #1
movne r0, #0
bx lr
Trust your compiler. People who write them know their trade :). Before you learn more about the ARM Cortex you cant judge the compiler this simple way as you do now.
The code from your example is very well optimized and simple. Keil does a very good job.

As pointed out the difference is branching vs not branching. If you can avoid branching you want to avoid branching.
While the ARM documentation may be interesting, as with an x86 and a full sized ARM and many other places the system plays as of a role here. High performance cores like ones from ARM are sensitive to the system implementation. These cortex-m cores are used in microcontrollers which are quite cost sensitive, so while they blow away a PIC or AVR or msp430 for mips to mhz and mips per dollar they are still cost sensitive. With newer technology or perhaps higher cost, you are starting to see flashes that are at the speed of the processor for the full range (do not have to add wait states at various places across the range of valid clock speeds), but for a long time you saw the flash at half the speed of the core at the slowest core speeds. And then getting worse as you choose higher core speeds. But sram often matching the core. Either way flash is a major portion of the cost of the part and how much and how fast it is to some extent drives part price.
Depending on the core (anything from ARM) the fetch size and as a result alignment varies and as a result benchmarks can be skewed/manipulated based on alignment of a loop style test and how many fetches are needed (trivial to demonstrate with many cortex-ms). The cortex-ms are generally either a halfword or full word fetch and some are compile time options for the chip vendor (so you might have two chips with the same core but the performance varies). And this can be demonstrated too...just not here...unless pushed, I have done this demo too many times at this site now. But we can manage that here in this test.
I do not have a cortex-m3 handy I would have to dig one out and wire it up if need be, should not need to though have a cortex-m4 handy which is also an armv7-m. A NUCLEO-F411RE
Test fixture
.thumb_func
.globl HOP
HOP:
bx r2
.balign 0x20
.thumb_func
.globl TEST0
TEST0:
push {r4,r5}
mov r4,#0
mov r5,#0
ldr r2,[r0]
t0:
cmp r4,r5
beq skip
skip:
subs r1,r1,#1
bne t0
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
The systick timer generally works just fine for these kinds of tests, no need to mess with the debuggers timer it often just shows the same thing with more work. More than enough here.
Called like this with the result printed out in hex
hexstring(TEST0(STK_CVR,0x10000));
hexstring(TEST0(STK_CVR,0x10000));
copy the flash code to ram and execute there
hexstring(HOP(STK_CVR,0x10000,0x20000001));
hexstring(HOP(STK_CVR,0x10000,0x20000001));
Now the stm32's have this cache thing in front of the flash which affects loop based benchmarks like these as well as other benchmarks against these parts, sometimes you cannot get past that and you end up with a bogus benchmark. But not in this case.
To demonstrate fetch effects you want a system delay in fetching, if the fetches are too fast you might not see the fetch effects.
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d1ff bne.n 8000030 <skip>
08000030 <skip>:
00050001 <-- flash time
00050001 <-- flash time
00060004 <-- sram time
00060004 <-- sram time
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: d0ff beq.n 8000030 <skip>
08000030 <skip>:
00060001
00060001
00080000
00080000
0800002c <t0>:
800002c: 42ac cmp r4, r5
800002e: bf00 nop
08000030 <skip>:
00050001
00050001
00060000
00060000
So we can see that if the branch is not taken it is the same as a nop. As far as this loop based test goes. So perhaps there is a branch predictor (often a small cache that remembers the last N number of branches and their destinations and can start prefetch a clock or two early). I did not dig into it yet, did not really need to as we can already see that there is a performance cost due to a branch that has to be taken (making your suggested code not equal despite the same number of instructions, this is the same number of instructions but not equal performance).
So the quickest way to remove the loop and avoid the stm32 cache thing is to do something like this in ram
push {r4,r5}
mov r4,#0
mov r5,#0
cmp r4,r5
ldr r2,[r0]
instruction under test repeated many times
ldr r3,[r0]
subs r0,r2,r3
pop {r4,r5}
bx lr
with the instruction under test being a bne to the next, a beq to the next or a nop
// 800002e: d1ff bne.n 8000030 <skip>
00002001
// 800002e: d0ff beq.n 8000030 <skip>
00004000
// 800002e: bf00 nop
00001001
I did not have room for 0x10000 instructions so I used 0x1000, and we can see that there is a hit for both branch types with the one that does branch being more costly.
Note that the loop based benchmark did not show this difference, have to be careful doing benchmarks or judging results. Even the ones I have shown here.
I could spend more time tweaking core settings or system settings, but based on experience I think this has already demonstrated the desire not to have a cmp, bne, cbnz replace eor, orr, bne. Now to be fair, your other one where it is a eor.w (thumb2 extensions) that burns more clocks than thumb2 instructions so there is another thing to consider (I measured it as well).
Remember for these high performance cores you need to be very sensitive to fetching and fetch alignment, very easy to make a bad benchmark. Not that an x86 is not high performance, but to make the inefficient core run smoother there is a ton of stuff around it to try to keep the core fed, similar to running a semi-truck vs a sports car, the truck can be efficient once up to speed on the highway but city driving, not so much even keeping to the speed limit a Yugo will get across town faster than the semi truck (if it does not break down). Fetch effects, unaligned transfers, etc are difficult to see in an x86, but an ARM somewhat easy, so to get the best performance you want to avoid the easy cycle eaters.
Edit
Note that I jumped to conclusions too early about what GCC produces. Had to work more on trying to craft an equivalent comparison. I started with
unsigned long long fun2 ( unsigned long long a)
{
if(a==0) return(1);
return(0);
}
unsigned long long fun3 ( unsigned long long a)
{
if(a!=0) return(1);
return(0);
}
00000028 <fun2>:
28: 460b mov r3, r1
2a: 2100 movs r1, #0
2c: 4303 orrs r3, r0
2e: bf0c ite eq
30: 2001 moveq r0, #1
32: 4608 movne r0, r1
34: 4770 bx lr
36: bf00 nop
00000038 <fun3>:
38: 460b mov r3, r1
3a: 2100 movs r1, #0
3c: 4303 orrs r3, r0
3e: bf14 ite ne
40: 2001 movne r0, #1
42: 4608 moveq r0, r1
44: 4770 bx lr
46: bf00 nop
Which used an it instruction which is a natural solution here since the if-then-else cases can be a single instruction. Interesting that they chose to use r1 instead of the immediate #0 I wonder if that is a generic optimization, due to complexity with immediates on a fixed length instruction set or perhaps immediates take less space on some architectures. Who knows.
800002e: bf0c ite eq
8000030: bf00 nopeq
8000032: bf00 nopne
00003002
00003002
800002e: bf14 ite ne
8000030: bf00 nopne
8000032: bf00 nopeq
00003002
00003002
Using sram 0x1000 sets of three instructions linearly, so 0x3002 means 1 clock per instruction on average.
Putting a mov in the it block doesn't change performance
ite eq
moveq r0, #1
movne r0, r1
It is still one clock per.
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a)
{
for(;a!=0;a--)
{
more_fun(5);
}
return(0);
}
48: b538 push {r3, r4, r5, lr}
4a: ea50 0301 orrs.w r3, r0, r1
4e: d00a beq.n 66 <fun4+0x1e>
50: 4604 mov r4, r0
52: 460d mov r5, r1
54: 2005 movs r0, #5
56: f7ff fffe bl 0 <more_fun>
5a: 3c01 subs r4, #1
5c: f165 0500 sbc.w r5, r5, #0
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
66: 2000 movs r0, #0
68: 2100 movs r1, #0
6a: bd38 pop {r3, r4, r5, pc}
This is basically the compare with zero
60: ea54 0305 orrs.w r3, r4, r5
64: d1f6 bne.n 54 <fun4+0xc>
Against another
void more_fun ( unsigned int );
unsigned long long fun4 ( unsigned long long a, unsigned long long b)
{
for(;a!=b;a--)
{
more_fun(5);
}
return(0);
}
00000048 <fun4>:
48: 4299 cmp r1, r3
4a: bf08 it eq
4c: 4290 cmpeq r0, r2
4e: d011 beq.n 74 <fun4+0x2c>
50: b5f8 push {r3, r4, r5, r6, r7, lr}
52: 4604 mov r4, r0
54: 460d mov r5, r1
56: 4617 mov r7, r2
58: 461e mov r6, r3
5a: 2005 movs r0, #5
5c: f7ff fffe bl 0 <more_fun>
60: 3c01 subs r4, #1
62: f165 0500 sbc.w r5, r5, #0
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
6e: 2000 movs r0, #0
70: 2100 movs r1, #0
72: bdf8 pop {r3, r4, r5, r6, r7, pc}
74: 2000 movs r0, #0
76: 2100 movs r1, #0
78: 4770 bx lr
7a: bf00 nop
And they choose to use an it block here.
66: 42ae cmp r6, r5
68: bf08 it eq
6a: 42a7 cmpeq r7, r4
6c: d1f5 bne.n 5a <fun4+0x12>
It is on par with this for number of instructions.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
But those thumb2 instructions are going to execute longer. So overall I think GCC appears to have made a better sequence, but of course you want to check apples to apples start with the same C code and see what each produced. The gcc one reads easier than the eor/orr stuff, can think less about what it is doing.
8000040: 406c eors r4, r5
00001002
8000042: ea94 0305 eors.w r3, r4, r5
00002001
0x1000 instructions one is two halfwords (thumb2) one is one halfword (thumb). Takes two clocks not really surprised.
0x080001B0 EA840006 EOR r0,r4,r6
0x080001B4 EA850107 EOR r1,r5,r7
0x080001B8 4308 ORRS r0,r0,r1
0x080001BA D101 BNE 0x080001C0
I see six clocks there before adding any other penalties, not four (on this cortex-m4).
Note I made the eors.w aligned and unaligned and it did not change the performance. Still two clocks.

Why ARM NEON code slower than native C code

I'am implementing dequantize operation in ARM NEON (ARM8-A architecture). But i faced a strange point that ARM NEON version (11 ms) slower than C version (4.75 ms).
Here is our code,
NEON ASM code:
.arch armv8-a+crc
.file "dequantize.c"
.text
.align 2
.global retinaface_dequantize_v3
.type retinaface_dequantize_v3, %function
retinaface_dequantize_v3:
.LFB3836:
.cfi_startproc
mov x6, x0 // ivtmp.39, in_cls
add x0, x0, 266240 // _214, in_cls,
add x0, x0, 2560 // _214, _214,
adrp x7, .LC0 // tmp294,
ldr q21, [x7, #:lo12:.LC0] // tmp222,
adrp x7, .LC1 // tmp295,
ldr q20, [x7, #:lo12:.LC1] // tmp227,
adrp x7, .LC2 // tmp296,
ldr q19, [x7, #:lo12:.LC2] // tmp229,
adrp x7, .LC3 // tmp297,
ldr q1, [x7, #:lo12:.LC3] // tmp246,
adrp x7, .LC4 // tmp298,
ldr q0, [x7, #:lo12:.LC4] // tmp248,
.L2:
ldr q28, [x6] // _108,* ivtmp.39
ldr q27, [x6, 16] // _107,
ldr q26, [x1] // _106,* ivtmp.41
ldr q25, [x1, 16] // _105,
ldr q24, [x1, 32] // _104,
ldr q23, [x1, 48] // _103,
ldr q22, [x2] // _102,* ivtmp.45
ldr q18, [x2, 16] // _101,
ldr q17, [x2, 32] // _100,
ldr q16, [x2, 48] // _99,
ldr q7, [x2, 64] // _98,
ldr q6, [x2, 80] // _97,
ldr q5, [x2, 96] // _96,
ldr q4, [x2, 112] // _95,
ldr q3, [x2, 128] // _94,
ldr q2, [x2, 144] // _93,
fmulx v28.2d, v28.2d, v21.2d // tmp221, _108, tmp222
fmulx v27.2d, v27.2d, v21.2d // tmp224, _107, tmp222
fmulx v26.2d, v26.2d, v19.2d // tmp228, tmp226, tmp229
fmulx v25.2d, v25.2d, v19.2d // tmp233, tmp231, tmp229
fmulx v24.2d, v24.2d, v19.2d // tmp238, tmp236, tmp229
fmulx v23.2d, v23.2d, v19.2d // tmp243, tmp241, tmp229
fmulx v22.2d, v22.2d, v0.2d // tmp247, tmp245, tmp248
fmulx v18.2d, v18.2d, v0.2d // tmp252, tmp250, tmp248
fmulx v17.2d, v17.2d, v0.2d // tmp257, tmp255, tmp248
fmulx v16.2d, v16.2d, v0.2d // tmp262, tmp260, tmp248
fmulx v7.2d, v7.2d, v0.2d // tmp267, tmp265, tmp248
fmulx v6.2d, v6.2d, v0.2d // tmp272, tmp270, tmp248
fmulx v5.2d, v5.2d, v0.2d // tmp277, tmp275, tmp248
fmulx v4.2d, v4.2d, v0.2d // tmp282, tmp280, tmp248
fmulx v3.2d, v3.2d, v0.2d // tmp287, tmp285, tmp248
fmulx v2.2d, v2.2d, v0.2d // tmp292, tmp290, tmp248
fadd v26.2d, v26.2d, v20.2d // tmp226, _106, tmp227
fadd v25.2d, v25.2d, v20.2d // tmp231, _105, tmp227
fadd v24.2d, v24.2d, v20.2d // tmp236, _104, tmp227
fadd v23.2d, v23.2d, v20.2d // tmp241, _103, tmp227
fadd v22.2d, v22.2d, v1.2d // tmp245, _102, tmp246
fadd v18.2d, v18.2d, v1.2d // tmp250, _101, tmp246
fadd v17.2d, v17.2d, v1.2d // tmp255, _100, tmp246
fadd v16.2d, v16.2d, v1.2d // tmp260, _99, tmp246
fadd v7.2d, v7.2d, v1.2d // tmp265, _98, tmp246
fadd v6.2d, v6.2d, v1.2d // tmp270, _97, tmp246
fadd v5.2d, v5.2d, v1.2d // tmp275, _96, tmp246
fadd v4.2d, v4.2d, v1.2d // tmp280, _95, tmp246
fadd v3.2d, v3.2d, v1.2d // tmp285, _94, tmp246
fadd v2.2d, v2.2d, v1.2d // tmp290, _93, tmp246
str q28, [x3] // tmp221,* ivtmp.55
str q27, [x3, 16] // tmp224,
str q26, [x4] // tmp228,* ivtmp.57
str q25, [x4, 16] // tmp233,
str q24, [x4, 32] // tmp238,
str q23, [x4, 48] // tmp243,
str q22, [x5] // tmp247,* ivtmp.61
str q18, [x5, 16] // tmp252,
str q17, [x5, 32] // tmp257,
str q16, [x5, 48] // tmp262,
str q7, [x5, 64] // tmp267,
str q6, [x5, 80] // tmp272,
str q5, [x5, 96] // tmp277,
str q4, [x5, 112] // tmp282,
str q3, [x5, 128] // tmp287,
str q2, [x5, 144] // tmp292,
add x6, x6, 32 // ivtmp.39, ivtmp.39,
add x1, x1, 64 // ivtmp.41, ivtmp.41,
add x2, x2, 160 // ivtmp.45, ivtmp.45,
add x3, x3, 32 // ivtmp.55, ivtmp.55,
add x4, x4, 64 // ivtmp.57, ivtmp.57,
add x5, x5, 160 // ivtmp.61, ivtmp.61,
cmp x6, x0 // ivtmp.39, _214
bne .L2 //,
// dequantize.c:475: }
ret
.cfi_endproc
.LFE3836:
.size retinaface_dequantize_v3, .-retinaface_dequantize_v3
.section .rodata.cst16,"aM",#progbits,16
.align 4
.LC0:
.word 0
.word 1064304640
.word 0
.word 1064304640
.LC1:
.word 0
.word -1067417600
.word 0
.word -1067417600
.LC2:
.word 536870912
.word 1068027667
.word 536870912
.word 1068027667
.LC3:
.word 0
.word -1067515904
.word 0
.word -1067515904
.LC4:
.word 3758096384
.word 1069039660
.word 3758096384
.word 1069039660
.ident "GCC: (Debian 8.3.0-6) 8.3.0"
.section .note.GNU-stack,"",#progbits
C code:
void retinaface_dequantize_v0(uint8_t *in_cls, uint8_t *in_bbox, uint8_t *in_ldm, double *out_cls, double *out_bbox, double *out_ldm, uint64_t length)
{
double const dequan_cls = 0.00390625;
double const dequan_bbox = 0.048454854637384415;
double const dequan_ldm = 0.0947292372584343;
const uint8_t bbox_minus = 132;
const uint8_t ldm_minus = 124;
for (int64_t i = 16799;i>=0;i--)
{
//cls
out_cls[i*2] = dequan_cls * (uint8_t)in_cls[i*2];
out_cls[i*2+1] = dequan_cls * (uint8_t)in_cls[i*2+1];
//bbox
out_bbox[i*4] = dequan_bbox * ((uint8_t)in_bbox[i*4] - bbox_minus);
out_bbox[i*4+1] = dequan_bbox * ((uint8_t)in_bbox[i*4+1] - bbox_minus);
out_bbox[i*4+2] = dequan_bbox * ((uint8_t)in_bbox[i*4+2] - bbox_minus);
out_bbox[i*4+3] = dequan_bbox * ((uint8_t)in_bbox[i*4+3] - bbox_minus);
//ldm
out_ldm[i*10] = dequan_ldm * ((uint8_t)in_ldm[i*10] - ldm_minus);
out_ldm[i*10+1] = dequan_ldm * ((uint8_t)in_ldm[i*10+1] - ldm_minus);
out_ldm[i*10+2] = dequan_ldm * ((uint8_t)in_ldm[i*10+2] - ldm_minus);
out_ldm[i*10+3] = dequan_ldm * ((uint8_t)in_ldm[i*10+3] - ldm_minus);
out_ldm[i*10+4] = dequan_ldm * ((uint8_t)in_ldm[i*10+4] - ldm_minus);
out_ldm[i*10+5] = dequan_ldm * ((uint8_t)in_ldm[i*10+5] - ldm_minus);
out_ldm[i*10+6] = dequan_ldm * ((uint8_t)in_ldm[i*10+6] - ldm_minus);
out_ldm[i*10+7] = dequan_ldm * ((uint8_t)in_ldm[i*10+7] - ldm_minus);
out_ldm[i*10+8] = dequan_ldm * ((uint8_t)in_ldm[i*10+8] - ldm_minus);
out_ldm[i*10+9] = dequan_ldm * ((uint8_t)in_ldm[i*10+9] - ldm_minus)
}
}

The code looks like code that a good compiler will vectorize completely. Disassemble the code to see what the compiler has generated.
If it has not, use NEON compiler intrinsics to generate fully vectorized code, instead of assembler. The advantage of using compiler intrinsics is that the compiler can reorder ARM, and ARM NEON instructions in order to prevent pipeline stalls, and maximize concurrent execution. And it does a very good job.
The most obvious problem with your code: you need to interleave arithmetic operations and ARM assembler with vld1/vst1 load/store instructions in order to prevent the pipeline from stalling while waiting for consecutive loads/stores to complete. The ARM instructions at the head of the loop need to be pushed down between initial load instructions as as much as possible, where they will execute in parallel, and the NEON math instructions need to be scattered between the initial load instructions so that they execute concurrently with stalled memory reads (and writes).
Compilers LOVE doing that sort of instruction scheduling, and do a better job than even the most expert of assembler coders can do by hand. An expert would start with the compiler-generated code, and use the (very expensive and difficult to use) ARM profiling tools to look for places where the compiler has mis-scheduled operations. And would be happy to get a 5 or 10% improvement over well-written C/C++ code.
An added bonus when using intrinsics, is that you can tell the compiler which specific ARM processor you're using, and the compiler will optimize scheduling for that particular processor. So you won't end up rewriting the code from scratch when you migrate from an a52 to an a76, or to an Arm9 processor.
fwiw, all the vld1 data reads will be cache misses. Reasonably optimized code should be spending 99% of its time waiting for the memory reads, with pretty much all of the rest of the code executing concurrently while waiting for the reads to complete.

Different gcc assembly when using designated initializers

I was checking some gcc generated assembly for ARM and noticed that I get strange results if I use designated initializers:
E.g. if I have this code:
struct test
{
int x;
int y;
};
__attribute__((noinline))
struct test get_struct_1(void)
{
struct test x;
x.x = 123456780;
x.y = 123456781;
return x;
}
__attribute__((noinline))
struct test get_struct_2(void)
{
return (struct test){ .x = 123456780, .y = 123456781 };
}
I get the following output with gcc -O2 -std=C11 for ARM (ARM GCC 6.3.0):
get_struct_1:
ldr r1, .L2
ldr r2, .L2+4
stm r0, {r1, r2}
bx lr
.L2:
.word 123456780
.word 123456781
get_struct_2: // <--- what is happening here
mov r3, r0
ldr r2, .L5
ldm r2, {r0, r1}
stm r3, {r0, r1}
mov r0, r3
bx lr
.L5:
.word .LANCHOR0
I can see the constants for the first function, but I don't understand how get_struct_2 works.
If I compile for x86, both functions just load the same single 64-bit value in a single instruction.
get_struct_1:
movabs rax, 530242836987890956
ret
get_struct_2:
movabs rax, 530242836987890956
ret
Am I provoking some undefined behavior, or is this .LANCHOR0 somehow related to these constants?

Looks like gcc shoots itself in the foot with an extra level of indirection after merging the loads of the constants into an ldm.
No idea why, but pretty obviously a missed optimization bug.
x86-64 is easy to optimize for; the entire 8-byte constant can go in one immediate. But ARM often uses PC-relative loads for constants that are too big for one immediate.

qemu-arm branches to a seemingly abstract instruction

I am trying to build a binary translator for arm bear metal compiled code and I try to verify proper execution flow by comparing it to that of qemu-arm. I use the following command to dump the program flow:
qemu-arm -d in_asm,cpu -singlestep -D a.flow a.out
I noticed something strange, where the program seems to jump to an irrelevant instruction, since 0x000080b4 is not the branch nor the next instruction following 0x000093ec.
0x000093ec: 1afffff9 bne 0x93d8
R00=00000000 R01=00009c44 R02=00000002 R03=00000000
R04=00000001 R05=0001d028 R06=00000002 R07=00000000
R08=00000000 R09=00000000 R10=0001d024 R11=00000000
R12=f6ffed88 R13=f6ffed88 R14=000093e8 R15=000093ec
PSR=20000010 --C- A usr32
R00=00000000 R01=00009c44 R02=00000002 R03=00000000
R04=00000001 R05=0001d028 R06=00000002 R07=00000000
R08=00000000 R09=00000000 R10=0001d024 R11=00000000
R12=f6ffed88 R13=f6ffed88 R14=000093e8 R15=000093d8
PSR=20000010 --C- A usr32
----------------
IN:
0x000080b4: e59f3060 ldr r3, [pc, #96] ; 0x811c
The instruction that actually executes corresponds to the beggining of the <frame_dummy> tag in the disassembly. Can someone explain what actually happens within the emulator and is this behavior normal in the ARM architecture? The program was compiled with: arm-none-eabi-gcc --specs=rdimon.specs a.c
Here is the same segment of the program flow without the CPU state:
0x0000804c: e59f3018 ldr r3, [pc, #24] ; 0x806c
0x00008050: e3530000 cmp r3, #0 ; 0x0
0x00008054: 01a0f00e moveq pc, lr
----------------
IN: __libc_init_array
0x000093e8: e1560004 cmp r6, r4
0x000093ec: 1afffff9 bne 0x93d8
----------------
IN:
0x000080b4: e59f3060 ldr r3, [pc, #96] ; 0x811c
0x000080b8: e3530000 cmp r3, #0 ; 0x0
0x000080bc: 0a000009 beq 0x80e8
This is the disassembly of this part:
93d4: 0a000005 beq 93f0 <__libc_init_array+0x68>
93d8: e2844001 add r4, r4, #1
93dc: e4953004 ldr r3, [r5], #4
93e0: e1a0e00f mov lr, pc
93e4: e1a0f003 mov pc, r3
93e8: e1560004 cmp r6, r4
93ec: 1afffff9 bne 93d8 <__libc_init_array+0x50>
93f0: e8bd4070 pop {r4, r5, r6, lr}

It is a reverse jump to previously emitted TB, you don't even have to read that much back:
IN: __libc_init_array
0x000093d8: e2844001 add r4, r4, #1 ; 0x1
0x000093dc: e4953004 ldr r3, [r5], #4
0x000093e0: e1a0e00f mov lr, pc
0x000093e4: e1a0f003 mov pc, r3
----------------
IN: register_fini
0x0000804c: e59f3018 ldr r3, [pc, #24] ; 0x806c
0x00008050: e3530000 cmp r3, #0 ; 0x0
0x00008054: 01a0f00e moveq pc, lr
----------------
IN: __libc_init_array
0x000093e8: e1560004 cmp r6, r4
0x000093ec: 1afffff9 bne 0x93d8
So, qemu is not showing it again. Notice this loop is iterating function pointers, the first one points to register_fini and the second one to the magical 0x000080b4 address in question (no symbol for it). When this unnamed function conditionally returns with moveq pc, lr control is transferred back to __libc_init_array address 0x000093e8 which then determines that the array end has been reached and again just returns to its caller at 0x000093f0.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight