ARM NEON: comparing 128 bit values - arm

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored into NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).
So far I have the following:
(1) Using the VFP floating point comparison:
vcmp.f64 d0, d6
vmrs APSR_nzcv, fpscr
vcmpeq.f64 d1, d7
vmrseq APSR_nzcv, fpscr
If the 64bit "floats" are equivalent to NaN, this version will not work.
(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):
vceq.i32 q15, q0, q3
vmovn.i32 d31, q15
vshl.s16 d31, d31, #8
vcmp.f64 d31, d29
vmrs APSR_nzcv, fpscr
The D29 register is previously preloaded with the right 16bit pattern:
vmov.i16 d29, #65280 ; 0xff00
My question is: is there any better than this? Am I overseeing some obvious way to do it?

I believe you can reduce it by one instruction. By using the shift left and insert (VLSI), you can combine the 4 32-bit values of Q15 into 4 16-bit values in D31. You can then compare with 0 and get the floating point flags.
vceq.i32 q15, q0, q3
vlsi.32 d31, d30, #16
vcmp.f64 d31, #0
vmrs APSR_nzcv, fpscr

Related

ARM-v8 NEON: is there an instruction to split a single normal register across multiple lanes of a NEON register?

I'm new to ARM-v8 (AArch64) and only did a little bit of NEON coding in ARM-v7 (but I'm very comfortable with A32 and ok(*) with normal A64).
Ultimately what I'm trying to do is count the frequency of each set bit [31:0] in a bunch (up to 15) of 32-bit values. I.e in these 15 values, how many times is bit 0 set, how many times is bit 1 set, etc.
So, what I'd like to do is split the 32 bits over 32 nibbles in a 128 bit NEON register and then accumulate the NEON register, like this:
// args(x0: ptr to array of 16 32-bit words) ret(v0: sum of set bits as 32 nibbles)
mov w2, 16 // w2: loop counter
mov v0, 0 // v0: accumulate count
1:
ldr w1, [x0], 4
split v1, w1 // here some magic occurs
add v0.16b, v0.16b, v1.16b
subs w2, w2, 1
bne 1b
I'm not having much luck with the ARM documentation. The ARMv8-ARM just has an alphabetical listing of the 354 NEON instructions, (800 pages of pseudocode). The ARMv8-A Programmer's guide only has 14 pages of introduction and the enticing statement "New lane insert and extract instructions have been added to support the new register packing scheme." And the NEON Programmer's Guide is about ARM-v7.
Assuming there isn't a single instruction to do that, what would be the most efficient way of doing it? -- Not looking for a complete solution, but can NEON help at all? There wouldn't be much point if I have to load each lane separately...
(*) Can't say I like A64 though. :-(
You should think out of the box. That the source data is 32bit wide doesn't mean you should access them by 32bit.
By reading them in 4x8bit manner, the problem is much more simplified. Below is splitting and counting each of the 32bits in the array:
/*
* alqCountBits.S
*
* Created on: 2020. 5. 26.
* Author: Jake 'Alquimista' LEE
*/
.arch armv8-a
.global alqCountBits
.text
// extern void alqCountBits(uint32_t *pDst, uint32_t *pSrc, uint32_t nLength);
// assert(nLength % 2 == 0);
pDst .req x0
pSrc .req x1
length .req w2
.balign 64
.func
alqCountBits:
adr x3, .LShiftTable
movi v30.16b, #1
ld1r {v31.2d}, [x3]
movi v0.16b, #0
movi v1.16b, #0
movi v2.16b, #0
movi v3.16b, #0
movi v4.16b, #0
movi v5.16b, #0
movi v6.16b, #0
movi v7.16b, #0
.balign 64
1:
ld4r {v16.8b, v17.8b, v18.8b, v19.8b}, [pSrc], #4
ld4r {v20.8b, v21.8b, v22.8b, v23.8b}, [pSrc], #4
subs length, length, #2
trn1 v24.2d, v16.2d, v17.2d
trn1 v25.2d, v18.2d, v19.2d
trn1 v26.2d, v20.2d, v21.2d
trn1 v27.2d, v22.2d, v23.2d
ushl v16.16b, v24.16b, v31.16b
ushl v17.16b, v25.16b, v31.16b
ushl v18.16b, v26.16b, v31.16b
ushl v19.16b, v27.16b, v31.16b
and v16.16b, v16.16b, v30.16b
and v17.16b, v17.16b, v30.16b
and v18.16b, v18.16b, v30.16b
and v19.16b, v19.16b, v30.16b
uaddl v24.8h, v18.8b, v16.8b
uaddl2 v25.8h, v18.16b, v16.16b
uaddl v26.8h, v19.8b, v17.8b
uaddl2 v27.8h, v19.16b, v17.16b
uaddw v0.4s, v0.4s, v24.4h
uaddw2 v1.4s, v1.4s, v24.8h
uaddw v2.4s, v2.4s, v25.4h
uaddw2 v3.4s, v3.4s, v25.8h
uaddw v4.4s, v4.4s, v26.4h
uaddw2 v5.4s, v5.4s, v26.8h
uaddw v6.4s, v6.4s, v27.4h
uaddw2 v7.4s, v7.4s, v27.8h
b.gt 1b
.balign 8
stp q0, q1, [pDst, #0]
stp q2, q3, [pDst, #32]
stp q4, q5, [pDst, #64]
stp q6, q7, [pDst, #96]
ret
.endfunc
.balign 8
.LShiftTable:
.dc.b 0, -1, -2, -3, -4, -5, -6, -7
.end
I don't like the aarch64 mnemonics either. For comparison I put the aarch32 version below:
/*
* alqCountBits.S
*
* Created on: 2020. 5. 26.
* Author: Jake 'Alquimista' LEE
*/
.syntax unified
.arm
.arch armv7-a
.fpu neon
.global alqCountBits
.text
// extern void alqCountBits(uint32_t *pDst, uint32_t *pSrc, uint32_t nLength);
// assert(nLength % 2 == 0);
pDst .req r0
pSrc .req r1
length .req r2
.balign 32
.func
alqCountBits:
adr r12, .LShiftTable
vpush {q4-q7}
vld1.64 {d30}, [r12]
vmov.i8 q14, #1
vmov.i8 q0, #0
vmov.i8 q1, #0
vmov.i8 q2, #0
vmov.i8 q3, #0
vmov.i8 q4, #0
vmov.i8 q5, #0
vmov.i8 q6, #0
vmov.i8 q7, #0
vmov d31, d30
.balign 32
1:
vld4.8 {d16[], d17[], d18[], d19[]}, [pSrc]!
vld4.8 {d20[], d21[], d22[], d23[]}, [pSrc]!
subs length, length, #2
vshl.u8 q8, q8, q15
vshl.u8 q9, q9, q15
vshl.u8 q10, q10, q15
vshl.u8 q11, q11, q15
vand q8, q8, q14
vand q9, q9, q14
vand q10, q10, q14
vand q11, q11, q14
vaddl.u8 q12, d20, d16
vaddl.u8 q13, d21, d17
vaddl.u8 q8, d22, d18
vaddl.u8 q10, d23, d19
vaddw.u16 q0, q0, d24
vaddw.u16 q1, q1, d25
vaddw.u16 q2, q2, d26
vaddw.u16 q3, q3, d27
vaddw.u16 q4, q4, d16
vaddw.u16 q5, q5, d17
vaddw.u16 q6, q6, d20
vaddw.u16 q7, q7, d21
bgt 1b
.balign 8
vst1.32 {q0, q1}, [pDst]!
vst1.32 {q2, q3}, [pDst]!
vst1.32 {q4, q5}, [pDst]!
vst1.32 {q6, q7}, [pDst]
vpop {q4-q7}
bx lr
.endfunc
.balign 8
.LShiftTable:
.dc.b 0, -1, -2, -3, -4, -5, -6, -7
.end
As you can see, trn1 equivalence is not needed at all in aarch32
Still, I overall prefer aarch64 so much due to the sheer number of registers.
I don't think it can be done per nibble, but per byte should work.
Load a vector with the relevant source bit set in each byte (you'll need two of these as we probably only can do this per byte and not per nibble). Duplicate each byte of the word into 8 byte sized elements each, in two vectors. Do a cmtst with both masks (which will set all bits, i.e. set it to -1, in an element if the corresponding bit was set), and accumulate.
Something like this, untested:
.section .rodata
mask: .byte 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128
.text
mov w2, 16 // w2: loop counter
mov v0.16b, 0 // v0: accumulate count 1
mov v1.16b, 0 // v1: accumulate count 2
adrp w3, mask
add w3, :lo12:mask
ld1 {v2.16b}, [w3] // v2: mask with one bit set in each byte
1:
ld1r {v3.4s}, [x0], #4 // One vector with the full 32 bit word
subs w2, w2, 1
dup v4.8b, v3.b[0] // v4: vector containing the lowest byte of the word
dup v5.8b, v3.b[1] // v5: vector containing the second lowest byte of the word
dup v6.8b, v3.b[2]
dup v7.8b, v3.b[3]
ins v4.d[1], v5.d[0] // v4: elements 0-7: lowest byte, elements 8-15: second byte
ins v6.d[1], v7.d[0] // v6: elements 0-7: third byte, elements 8-15: fourth byte
cmtst v4.16b, v4.16b, v2.16b // v4: each byte -1 if the corresponding bit was set
cmtst v6.16b, v6.16b, v2.16b // v5: each byte -1 if the corresponding bit was set
sub v0.16b, v0.16b, v4.16b // accumulate: if bit was set, subtract -1 i.e. add +1
sub v1.16b, v1.16b, v6.16b
b.ne 1b
// Done, count of individual bits in byte sized elements in v0-v1
EDIT: The ld4r approach as suggested by Jake 'Alquimista' LEE is actually better than the loading here; the ld1r followed by four dup could be replaced by ld4r {v4.8b, v5.8b, v6.8b, v7.8h}, [x0], #4 here, keeping the logic the same. For the rest, whether cmtst or ushl + and ends up faster, one would have to test and measure to see. And handling two 32 bit words at the same time, as in his solution, probably gives better throughput than my solution here.
Combining the above answers, and modifying my requirements ;-) I came up with:
tst:
ldr x0, =test_data
ldr x1, =mask
ld1 {v2.2d}, [x1] // ld1.2d v2, [x1] // load 2 * 64 = 128 bits
movi v0.16b, 0
mov w2, 8
1:
ld1r {v1.8h}, [x0], 2 // ld1r.8h v1, [x0], 2 // repeat one 16-bit word across eight 16-bit lanes
cmtst v1.16b, v1.16b, v2.16b // cmtst.16b v1, v1, v2 // sets -1 in each 8bit word of 16 8-bit lanes if input matches mask
sub v0.16b, v0.16b, v1.16b // sub.16b v0, v0, v1 // sub -1 = add +1
subs w2, w2, 1
bne 1b
// v0 contains 16 bytes, mildly shuffled.
If one wants them unshuffled:
mov v1.d[0], v0.d[1]
uzp1 v2.8b, v0.8b, v1.8b
uzp2 v3.8b, v0.8b, v1.8b
mov v2.d[1], v3.d[0]
// v2 contains 16 bytes, in order.
The following counts up to fifteen samples with 32 bits (accumulating in 32 nibbles):
tst2:
ldr x0, =test_data2
ldr x1, =mask2
ld1 {v2.4s, v3.4s, v4.4s, v5.4s}, [x1] // ld1.4s {v2, v3, v4, v5}, [x1]
movi v0.16b, 0
mov w2, 8
1:
ld1r {v1.4s}, [x0], 4 // ld1r.4s v1, [x0], 4 // repeat one 32-bit word across four 32-bit lanes
cmtst v6.16b, v1.16b, v2.16b // cmtst.16b v6, v1, v2 // upper nibbles
cmtst v1.16b, v1.16b, v3.16b // cmtst.16b v1, v1, v3 // lower nibbles
and v6.16b, v6.16b, v4.16b // and.16b v6, v6, v4 // upper inc 0001.0000 x 16
and v1.16b, v1.16b, v5.16b // and.16b v1, v1, v5 // lower inc 0000.0001 x 16
orr v1.16b, v1.16b, v6.16b // orr.16b v1, v1, v6
add v0.16b, v0.16b, v1.16b // add.16b v0, v0, v1 // accumulate
subs w2, w2, 1
bne 1b
// v0 contains 32 nibbles -- somewhat shuffled, but that's ok.
// fedcba98.76543210.fedcba98.76543210.fedcba98.76543210.fedcba98.76543210 fedcba98.76543210.fedcba98.76543210.fedcba98.76543210.fedcba98.76543210
// 10000000.10000000.01000000.01000000.00100000.00100000.00010000.00010000 00001000.00001000.00000100.00000100.00000010.00000010.00000001.00000001
// f 7 e 6 d 5 c 4 b 3 a 2 9 1 8 0
mask:
.quad 0x0808040402020101
.quad 0x8080404020201010
test_data:
.hword 0x0103
.hword 0x0302
.hword 0x0506
.hword 0x080A
.hword 0x1010
.hword 0x2020
.hword 0xc040
.hword 0x8080
// FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰.FEDCBA98.76543210.fedcba⁹⁸.⁷⁶⁵⁴³²¹⁰
// 10001000 10001000 10001000 10001000 01000100 01000100 01000100 01000100 00100010 00100010 00100010 00100010 00010001 00010001 00010001 00010001
// F B 7 3 f b ⁷ ³ E A 6 2 e a ⁶ ² D 9 5 1 d ⁹ ⁵ ¹ C 8 4 0 c ⁸ ⁴ ⁰
mask2:
.quad 0x8080808040404040 // v2
.quad 0x2020202010101010
.quad 0x0808080804040404 // v3
.quad 0x0202020201010101
.quad 0x1010101010101010 // v4
.quad 0x1010101010101010
.quad 0x0101010101010101 // v5
.quad 0x0101010101010101
test_data2:
.word 0xff000103
.word 0xff000302
.word 0xff000506
.word 0xff00080A
.word 0xff001010
.word 0xff002020
.word 0xff00c040
.word 0xff008080

How to understand why an ARM exception happens?

I'm trying understand what is the reason of ARM exception that I encounter.
It happens randomly during system startup, and may looks in few different ways.
One of simplest is following:
0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
r0 0x20000197 536871319
r1 0x20000197 536871319
r2 0x20000197 536871319
r3 0x20000197 536871319
r4 0x20000197 536871319
r5 0x6 6
r6 0x80000024 2147483684
r7 0x80007a0c 2147514892
r8 0x8004f0a8 2147807400
r9 0x80041340 2147750720
r10 0x80040a3c 2147748412
r11 0xffffffff 4294967295
r12 0x20000197 536871319
sp 0x7fffff88 0x7fffff88
lr 0x80002f04 2147495684
pc 0x8004e810 0x8004e810 <ti_sysbios_family_arm_a8_intcps_Hwi_vectors+16>
cpsr 0x20000197 536871319
PC = 8004E810, CPSR = 20000197 (ABORT mode, ARM IRQ dis.)
R0 = 20000197, R1 = 20000197, R2 = 20000197, R3 = 20000197
R4 = 20000197, R5 = 00000006, R6 = 80000024, R7 = 80007A0C
USR: R8 =8004F0A8, R9 =80041340, R10=80040A3C, R11 =FFFFFFFF, R12 =20000197
R13=80212590, R14=80040A3C
FIQ: R8 =AEE1D6FA, R9 =C07BA930, R10=1B0B137A, R11 =7EC3F1DF, R12 =2000019F
R13=80065CF8, R14=00000000, SPSR=00000000
SVC: R13=4030CB20, R14=00022071, SPSR=00000000
ABT: R13=7FFFFF88, R14=80002F04, SPSR=20000197
IRQ: R13=F4ADFD8A, R14=80041020, SPSR=8000011F
UND: R13=80085CF8, R14=ED0F7EF1, SPSR=00000000
(gdb) frame
#0 0x8004e810 in ti_sysbios_family_arm_a8_intcps_Hwi_vectors ()
(gdb) frame 1
#1 0x80002f04 in ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm(int0_t) ()
at /home/rnd_share/sysbios/bios_6_51_00_15/packages/ti/sysbios/family/arm/exc/Exception_asm_gnu.asm:103
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
(gdb) list
98 .func ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I
99
100 ti_sysbios_family_arm_exc_Exception_excHandlerDataAsm__I:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
102
103 mrc p15, #0, r12, c5, c0, #0 # read DFSR into r12
104 stmfd sp!, {r12} # save DFSR
105 mrc p15, #0, r12, c5, c0, #1 # read IFSR into r12
106 stmfd sp!, {r12} # save DFSR
107 mrc p15, #0, r12, c6, c0, #0 # read DFAR into r12
(gdb) monitor cp15 6 0 0 0
Reading CP15 register (6,0,0,0 = 0x7FFFFF54)
My understanding is that, there was some ongoing exception, which can be seen in frame 1.
It tries to save registers onto stack:
101 stmfd sp!, {r0-r12} # save r4-r12 while we're at it
But, stack pointer was incorrect at:
ABT: R13=7FFFFF88
I don't understand both:
What can be a cause of such value of SP in ABT and IRQ contexts ?
what is actually in frame 0 ? in other words, how Cortex reacted to data abort while being already in exception handler ?
This device usually starts normally, such situation happens like 3 times per 10 boots. It never happens when starting from debugger, only release and only when started from bootloader.
Two weeks later...
Boot procedure is following:
2nd stage bootloader loads application to memory
2nd stage bootloader jumps to application start.
main function of application is entered.
It turns out that sometimes statically initialized values of application have correct values after 1 step of booting, but then in 3 step they are corrupted. I mean application image is corrupted.
Caches haven't been flushed correctly between step 1 and 2.
Disabling caches at 2nd stage bootloader fixed problem at all.
Now need to fix that correctly.

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization.
I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.
The C code is as follows:
float dotp(
const float x[],
const float y[],
const short n
)
{
short i;
float suma;
suma = 0.0f;
for(i=0; i<n; i++)
{
suma += x[i] * y[i];
}
return suma;
}
I use the test frame provided by Agner Fog on the web testp.
The arrays which are used in this case are aligned:
int n = 2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);
float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;
Then I call the function dotp, n=2048, repeat=100000:
for (i = 0; i < repeat; i++)
{
sum = dotp(x,y,n);
}
I compile it with gcc 4.8.3, with the compile option -O3.
I compile this application on a computer which does not support FMA instructions, so you can see there are only SSE instructions.
The assembly code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
I do some analysis:
μops-fused la 0 1 2 3 4 5 6 7
movss 1 3 0.5 0.5
mulss 1 5 0.5 0.5 0.5 0.5
add 1 1 0.25 0.25 0.25 0.25
cmp 1 1 0.25 0.25 0.25 0.25
addss 1 3 1
jg 1 1 1 -----------------------------------------------------------------------------
total 6 5 1 2 1 1 0.5 1.5
After running, we get the result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
--------------------------------------------------------------------
542177906 |609942404 |1230100389 |205000027 |261069369 |205511063
--------------------------------------------------------------------
2.64 | 2.97 | 6.00 | 1 | 1.27 | 1.00
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-----------------------------------------------------------------------
205185258 | 205188997 | 100833 | 245370353 | 313581694 | 844
-----------------------------------------------------------------------
1.00 | 1.00 | 0.00 | 1.19 | 1.52 | 0.00
The second line is the value read from the Intel registers; the third line is divided by the branch number, "BrTaken".
So we can see, in the loop there are 6 instructions, 7 uops, in agreement with the analysis.
The numbers of uops run in port0 port1 port 5 port6 are similar to what the analysis says. I think maybe the uops scheduler does this, it may try to balance loads on the ports, am I right?
I absolutely do not understand know why there are only about 3 cycles per loop. According to Agner's instruction table, the latency of instruction mulss is 5, and there are dependencies between the loops, so as far as I see it should take at least 5 cycles per loop.
Could anyone shed some insight?
==================================================================
I tried to write an optimized version of this function in nasm, unrolling the loop by a factor of 8 and using the vfmadd231ps instruction:
.L2:
vmovaps ymm1, [rdi+rax]
vfmadd231ps ymm0, ymm1, [rsi+rax]
vmovaps ymm2, [rdi+rax+32]
vfmadd231ps ymm3, ymm2, [rsi+rax+32]
vmovaps ymm4, [rdi+rax+64]
vfmadd231ps ymm5, ymm4, [rsi+rax+64]
vmovaps ymm6, [rdi+rax+96]
vfmadd231ps ymm7, ymm6, [rsi+rax+96]
vmovaps ymm8, [rdi+rax+128]
vfmadd231ps ymm9, ymm8, [rsi+rax+128]
vmovaps ymm10, [rdi+rax+160]
vfmadd231ps ymm11, ymm10, [rsi+rax+160]
vmovaps ymm12, [rdi+rax+192]
vfmadd231ps ymm13, ymm12, [rsi+rax+192]
vmovaps ymm14, [rdi+rax+224]
vfmadd231ps ymm15, ymm14, [rsi+rax+224]
add rax, 256
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
24371315 | 27477805| 59400061 | 3200001 | 14679543 | 11011601
------------------------------------------------------------------------
7.62 | 8.59 | 18.56 | 1 | 4.59 | 3.44
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-------------------------------------------------------------------------
25960380 |26000252 | 47 | 537 | 3301043 | 10
------------------------------------------------------------------------------
8.11 |8.13 | 0.00 | 0.00 | 1.03 | 0.00
So we can see the L1 data cache reach 2*256bit/8.59, it is very near to the peak 2*256/8, the usage is about 93%, the FMA unit only used 8/8.59, the peak is 2*8/8, the usage is 47%.
So I think I've reached the L1D bottleneck as Peter Cordes expects.
==================================================================
Special thanks to Boann, fix so many grammatical errors in my question.
=================================================================
From Peter's reply, I get it that only "read and written" register would be the dependence, "writer-only" registers would not be the dependence.
So I try to reduce the registers used in loop, and I try to unrolling by 5, if everything is ok, I should meet the same bottleneck, L1D.
.L2:
vmovaps ymm0, [rdi+rax]
vfmadd231ps ymm1, ymm0, [rsi+rax]
vmovaps ymm0, [rdi+rax+32]
vfmadd231ps ymm2, ymm0, [rsi+rax+32]
vmovaps ymm0, [rdi+rax+64]
vfmadd231ps ymm3, ymm0, [rsi+rax+64]
vmovaps ymm0, [rdi+rax+96]
vfmadd231ps ymm4, ymm0, [rsi+rax+96]
vmovaps ymm0, [rdi+rax+128]
vfmadd231ps ymm5, ymm0, [rsi+rax+128]
add rax, 160 ;n = n+32
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
25332590 | 28547345 | 63700051 | 5100001 | 14951738 | 10549694
------------------------------------------------------------------------
4.97 | 5.60 | 12.49 | 1 | 2.93 | 2.07
uop p2 |uop p3 | uop p4 | uop p5 |uop p6 | uop p7
------------------------------------------------------------------------------
25900132 |25900132 | 50 | 683 | 5400909 | 9
-------------------------------------------------------------------------------
5.08 |5.08 | 0.00 | 0.00 |1.06 | 0.00
We can see 5/5.60 = 89.45%, it is a little smaller than urolling by 8, is there something wrong?
=================================================================
I try to unroll loop by 6, 7 and 15, to see the result.
I also unroll by 5 and 8 again, to double confirm the result.
The result is as follow, we can see this time the result is much better than before.
Although the result is not stable, the unrolling factor is bigger and the result is better.
| L1D bandwidth | CodeMiss | L1D Miss | L2 Miss
----------------------------------------------------------------------------
unroll5 | 91.86% ~ 91.94% | 3~33 | 272~888 | 17~223
--------------------------------------------------------------------------
unroll6 | 92.93% ~ 93.00% | 4~30 | 481~1432 | 26~213
--------------------------------------------------------------------------
unroll7 | 92.29% ~ 92.65% | 5~28 | 336~1736 | 14~257
--------------------------------------------------------------------------
unroll8 | 95.10% ~ 97.68% | 4~23 | 363~780 | 42~132
--------------------------------------------------------------------------
unroll15 | 97.95% ~ 98.16% | 5~28 | 651~1295 | 29~68
=====================================================================
I try to compile the function with gcc 7.1 in the web "https://gcc.godbolt.org"
The compile option is "-O3 -march=haswell -mtune=intel", that is similar to gcc 4.8.3.
.L3:
vmovss xmm1, DWORD PTR [rdi+rax]
vfmadd231ss xmm0, xmm1, DWORD PTR [rsi+rax]
add rax, 4
cmp rdx, rax
jne .L3
ret
Related:
AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that's a good thing, with cpu-architecture / asm details.
Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that way.
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell another version of this Q&A with more focus on unrolling to hide latency (and bottleneck on throughput), less background on what that even means. And with examples using C intrinsics.
Latency bounds and throughput bounds for processors for operations that must occur in sequence - a textbook exercise on dependency chains, with two interlocking chains, one reading from earlier in the other.
Look at your loop again: movss xmm1, src has no dependency on the old value of xmm1, because its destination is write-only. Each iteration's mulss is independent. Out-of-order execution can and does exploit that instruction-level parallelism, so you definitely don't bottleneck on mulss latency.
Optional reading: In computer architecture terms: register renaming avoids the WAR anti-dependency data hazard of reusing the same architectural register. (Some pipelining + dependency-tracking schemes before register renaming didn't solve all the problems, so the field of computer architecture makes a big deal out of different kinds of data hazards.
Register renaming with Tomasulo's algorithm makes everything go away except the actual true dependencies (read after write), so any instruction where the destination is not also a source register has no interaction with the dependency chain involving the old value of that register. (Except for false dependencies, like popcnt on Intel CPUs, and writing only part of a register without clearing the rest (like mov al, 5 or sqrtss xmm2, xmm1). Related: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?).
Back to your code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
The loop-carried dependencies (from one iteration to the next) are each:
xmm0, read and written by addss xmm0, xmm1, which has 3 cycle latency on Haswell.
rax, read and written by add rax, 1. 1c latency, so it's not the critical-path.
It looks like you measured the execution time / cycle-count correctly, because the loop bottlenecks on the 3c addss latency.
This is expected: the serial dependency in a dot product is the addition into a single sum (aka the reduction), not the multiplies between vector elements. (Unrolling with multiple sum accumulator variables / registers can hide that latency.)
That is by far the dominant bottleneck for this loop, despite various minor inefficiencies:
short i produced the silly cmp cx, ax, which takes an extra operand-size prefix. Luckily, gcc managed to avoid actually doing add ax, 1, because signed-overflow is Undefined Behaviour in C. So the optimizer can assume it doesn't happen. (update: integer promotion rules make it different for short, so UB doesn't come into it, but gcc can still legally optimize. Pretty wacky stuff.)
If you'd compiled with -mtune=intel, or better, -march=haswell, gcc would have put the cmp and jg next to each other where they could macro-fuse.
I'm not sure why you have a * in your table on the cmp and add instructions. (update: I was purely guessing that you were using a notation like IACA does, but apparently you weren't). Neither of them fuse. The only fusion happening is micro-fusion of mulss xmm1, [rsi+rax*4].
And since it's a 2-operand ALU instruction with a read-modify-write destination register, it stays macro-fused even in the ROB on Haswell. (Sandybridge would un-laminate it at issue time.) Note that vmulss xmm1, xmm1, [rsi+rax*4] would un-laminate on Haswell, too.
None of this really matters, since you just totally bottleneck on FP-add latency, much slower than any uop-throughput limits. Without -ffast-math, there's nothing compilers can do. With -ffast-math, clang will usually unroll with multiple accumulators, and it will auto-vectorize so they will be vector accumulators. So you can probably saturate Haswell's throughput limit of 1 vector or scalar FP add per clock, if you hit in L1D cache.
With FMA being 5c latency and 0.5c throughput on Haswell, you would need 10 accumulators to keep 10 FMAs in flight and max out FMA throughput by keeping p0/p1 saturated with FMAs. (Skylake reduced FMA latency to 4 cycles, and runs multiply, add, and FMA on the FMA units. So it actually has higher add latency than Haswell.)
(You're bottlenecked on loads, because you need two loads for every FMA. In other cases, you can actually gain add throughput by replacing some a vaddps instruction with an FMA with a multiplier of 1.0. This means more latency to hide, so it's best in a more complex algorithm where you have an add that's not on the critical path in the first place.)
Re: uops per port:
there are 1.19 uops per loop in the port 5, it is much more than expect 0.5, is it the matter about the uops dispatcher trying to make uops on every port same
Yes, something like that.
The uops are not assigned randomly, or somehow evenly distributed across every port they could run on. You assumed that the add and cmp uops would distribute evenly across p0156, but that's not the case.
The issue stage assigns uops to ports based on how many uops are already waiting for that port. Since addss can only run on p1 (and it's the loop bottleneck), there are usually a lot of p1 uops issued but not executed. So few other uops will ever be scheduled to port1. (This includes mulss: most of the mulss uops will end up scheduled to port 0.)
Taken-branches can only run on port 6. Port 5 doesn't have any uops in this loop that can only run there, so it ends up attracting a lot of the many-port uops.
The scheduler (which picks unfused-domain uops out of the Reservation Station) isn't smart enough to run critical-path-first, so this is assignment algorithm reduces resource-conflict latency (other uops stealing port1 on cycles when an addss could have run). It's also useful in cases where you bottleneck on the throughput of a given port.
Scheduling of already-assigned uops is normally oldest-ready first, as I understand it. This simple algorithm is hardly surprising, since it has to pick a uop with its inputs ready for each port from a 60-entry RS every clock cycle, without melting your CPU. The out-of-order machinery that finds and exploits the ILP is one of the significant power costs in a modern CPU, comparable to the execution units that do the actual work.
Related / more details: How are x86 uops scheduled, exactly?
More performance analysis stuff:
Other than cache misses / branch mispredicts, the three main possible bottlenecks for CPU-bound loops are:
dependency chains (like in this case)
front-end throughput (max of 4 fused-domain uops issued per clock on Haswell)
execution port bottlenecks, like if lots of uops need p0/p1, or p2/p3, like in your unrolled loop. Count unfused-domain uops for specific ports. Generally you can assuming best-case distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some.
A loop body or short block of code can be approximately characterized by 3 things: fused-domain uop count, unfused-domain count of which execution units it can run on, and total critical-path latency assuming best-case scheduling for its critical path. (Or latencies from each of input A/B/C to the output...)
For example of doing all three to compare a few short sequences, see my answer on What is the efficient way to count set bits at a position or lower?
For short loops, modern CPUs have enough out-of-order execution resources (physical register file size so renaming doesn't run out of registers, ROB size) to have enough iterations of a loop in-flight to find all the parallelism. But as dependency chains within loops get longer, eventually they run out. See Measuring Reorder Buffer Capacity for some details on what happens when a CPU runs out of registers to rename onto.
See also lots of performance and reference links in the x86 tag wiki.
Tuning your FMA loop:
Yes, dot-product on Haswell will bottleneck on L1D throughput at only half the throughput of the FMA units, since it takes two loads per multiply+add.
If you were doing B[i] = x * A[i] + y; or sum(A[i]^2), you could saturate FMA throughput.
It looks like you're still trying to avoid register reuse even in write-only cases like the destination of a vmovaps load, so you ran out of registers after unrolling by 8. That's fine, but could matter for other cases.
Also, using ymm8-15 can slightly increase code-size if it means a 3-byte VEX prefix is needed instead of 2-byte. Fun fact: vpxor ymm7,ymm7,ymm8 needs a 3-byte VEX while vpxor ymm8,ymm8,ymm7 only needs a 2-byte VEX prefix. For commutative ops, sort source regs from high to low.
Our load bottleneck means the best-case FMA throughput is half the max, so we need at least 5 vector accumulators to hide their latency. 8 is good, so there's plenty of slack in the dependency chains to let them catch up after any delays from unexpected latency or competition for p0/p1. 7 or maybe even 6 would be fine, too: your unroll factor doesn't have to be a power of 2.
Unrolling by exactly 5 would mean that you're also right at the bottleneck for dependency chains. Any time an FMA doesn't run in the exact cycle its input is ready means a lost cycle in that dependency chain. This can happen if a load is slow (e.g. it misses in L1 cache and has to wait for L2), or if loads complete out of order and an FMA from another dependency chain steals the port this FMA was scheduled for. (Remember that scheduling happens at issue time, so the uops sitting in the scheduler are either port0 FMA or port1 FMA, not an FMA that can take whichever port is idle).
If you leave some slack in the dependency chains, out-of-order execution can "catch up" on the FMAs, because they won't be bottlenecked on throughput or latency, just waiting for load results. #Forward found (in an update to the question) that unrolling by 5 reduced performance from 93% of L1D throughput to 89.5% for this loop.
My guess is that unroll by 6 (one more than the minimum to hide the latency) would be ok here, and get about the same performance as unroll by 8. If we were closer to maxing out FMA throughput (rather than just bottlenecked on load throughput), one more than the minimum might not be enough.
update: #Forward's experimental test shows my guess was wrong. There isn't a big difference between unroll5 and unroll6. Also, unroll15 is twice as close as unroll8 to the theoretical max throughput of 2x 256b loads per clock. Measuring with just independent loads in the loop, or with independent loads and register-only FMA, would tell us how much of that is due to interaction with the FMA dependency chain. Even the best case won't get perfect 100% throughput, if only because of measurement errors and disruption due to timer interrupts. (Linux perf measures only user-space cycles unless you run it as root, but time still includes time spent in interrupt handlers. This is why your CPU frequency might be reported as 3.87GHz when run as non-root, but 3.900GHz when run as root and measuring cycles instead of cycles:u.)
We aren't bottlenecked on front-end throughput, but we can reduce the fused-domain uop count by avoiding indexed addressing modes for non-mov instructions. Fewer is better and makes this more hyperthreading-friendly when sharing a core with something other than this.
The simple way is just to do two pointer-increments inside the loop. The complicated way is a neat trick of indexing one array relative to the other:
;; input pointers for x[] and y[] in rdi and rsi
;; size_t n in rdx
;;; zero ymm1..8, or load+vmulps into them
add rdx, rsi ; end_y
; lea rdx, [rdx+rsi-252] to break out of the unrolled loop before going off the end, with odd n
sub rdi, rsi ; index x[] relative to y[], saving one pointer increment
.unroll8:
vmovaps ymm0, [rdi+rsi] ; *px, actually py[xy_offset]
vfmadd231ps ymm1, ymm0, [rsi] ; *py
vmovaps ymm0, [rdi+rsi+32] ; write-only reuse of ymm0
vfmadd231ps ymm2, ymm0, [rsi+32]
vmovaps ymm0, [rdi+rsi+64]
vfmadd231ps ymm3, ymm0, [rsi+64]
vmovaps ymm0, [rdi+rsi+96]
vfmadd231ps ymm4, ymm0, [rsi+96]
add rsi, 256 ; pointer-increment here
; so the following instructions can still use disp8 in their addressing modes: [-128 .. +127] instead of disp32
; smaller code-size helps in the big picture, but not for a micro-benchmark
vmovaps ymm0, [rdi+rsi+128-256] ; be pedantic in the source about compensating for the pointer-increment
vfmadd231ps ymm5, ymm0, [rsi+128-256]
vmovaps ymm0, [rdi+rsi+160-256]
vfmadd231ps ymm6, ymm0, [rsi+160-256]
vmovaps ymm0, [rdi+rsi-64] ; or not
vfmadd231ps ymm7, ymm0, [rsi-64]
vmovaps ymm0, [rdi+rsi-32]
vfmadd231ps ymm8, ymm0, [rsi-32]
cmp rsi, rdx
jb .unroll8 ; } while(py < endy);
Using a non-indexed addressing mode as the memory operand for vfmaddps lets it stay micro-fused in the out-of-order core, instead of being un-laminated at issue. Micro fusion and addressing modes
So my loop is 18 fused-domain uops for 8 vectors. Yours takes 3 fused-domain uops for each vmovaps + vfmaddps pair, instead of 2, because of un-lamination of indexed addressing modes. Both of them still of course have 2 unfused-domain load uops (port2/3) per pair, so that's still the bottleneck.
Fewer fused-domain uops lets out-of-order execution see more iterations ahead, potentially helping it absorb cache misses better. It's a minor thing when we're bottlenecked on an execution unit (load uops in this case) even with no cache misses, though. But with hyperthreading, you only get every other cycle of front-end issue bandwidth unless the other thread is stalled. If it's not competing too much for load and p0/1, fewer fused-domain uops will let this loop run faster while sharing a core. (e.g. maybe the other hyper-thread is running a lot of port5 / port6 and store uops?)
Since un-lamination happens after the uop-cache, your version doesn't take extra space in the uop cache. A disp32 with each uop is ok, and doesn't take extra space. But bulkier code-size means the uop-cache is less likely to pack as efficiently, since you'll hit 32B boundaries before uop cache lines are full more often. (Actually, smaller code doesn't guarantee better either. Smaller instructions could lead to filling a uop cache line and needing one entry in another line before crossing a 32B boundary.) This small loop can run from the loopback buffer (LSD), so fortunately the uop-cache isn't a factor.
Then after the loop: Efficient cleanup is the hard part of efficient vectorization for small arrays that might not be a multiple of the unroll factor or especially the vector width
...
jb
;; If `n` might not be a multiple of 4x 8 floats, put cleanup code here
;; to do the last few ymm or xmm vectors, then scalar or an unaligned last vector + mask.
; reduce down to a single vector, with a tree of dependencies
vaddps ymm1, ymm2, ymm1
vaddps ymm3, ymm4, ymm3
vaddps ymm5, ymm6, ymm5
vaddps ymm7, ymm8, ymm7
vaddps ymm0, ymm3, ymm1
vaddps ymm1, ymm7, ymm5
vaddps ymm0, ymm1, ymm0
; horizontal within that vector, low_half += high_half until we're down to 1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vmovhlps xmm1, xmm0, xmm0
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0
vaddss xmm0, xmm1
; this is faster than 2x vhaddps
vzeroupper ; important if returning to non-AVX-aware code after using ymm regs.
ret ; with the scalar result in xmm0
For more about the horizontal sum at the end, see Fastest way to do horizontal SSE vector sum (or other reduction). The two 128b shuffles I used don't even need an immediate control byte, so it saves 2 bytes of code size vs. the more obvious shufps. (And 4 bytes of code-size vs. vpermilps, because that opcode always needs a 3-byte VEX prefix as well as an immediate). AVX 3-operand stuff is very nice compared the SSE, especially when writing in C with intrinsics so you can't as easily pick a cold register to movhlps into.

MSP430 microcontroller - how to check addressing modes

I'm programming a MSP430 in C language as a simulation of real microcontroller. I got stuck in addressing modes (https://en.wikipedia.org/wiki/TI_MSP430#MSP430_CPU), especially:
Addressing modes using R0 (PC)
Addressing modes using R2 (SR) and R3 (CG), special-case decoding
I don't understand what does mean 0(PC), 2(SR) and 3(CG). What they are?
How to check these values?
so for the source if the as bits are 01 and the source register bits are a 0 which is the pc for reference then
ADDR Symbolic. Equivalent to x(PC). The operand is in memory at address PC+x.
if the ad bit is a 1 and the destination is a 0 then also
ADDR Symbolic. Equivalent to x(PC). The operand is in memory at address PC+x.
x is going to be another word that follows this instruction so the cpu will fetch the next word, add it to the pc and that is the source
if the as bits are 11 and the source is register 0, the source is an immediate value which is in the next word after the instruction.
if the as bits are 01 and the source is a 2 which happens to be the SR register for reference then the address is x the next word after the instruction (&ADDR)
if the ad bit is a 1 and the destination register is a 2 then it is also an &ADDR
if the as bits are 10 the source bits are a 2, then the source is the constant value 4 and we dont have to burn a word in flash after the instruction for that 4.
it doesnt make sense to have a destination be a constant 4 so that isnt a real combination.
repeat for the rest of the table.
you can have both of these addressing modes at the same time
mov #0x5A80,&0x0120
generates
c000: b2 40 80 5a mov #23168, &0x0120 ;#0x5a80
c004: 20 01
which is
0x40b2 0x5a80 0x0120
0100000010110010
0100 opcode mov
0000 source
1 ad
0 b/w
11 as
0010 destination
so we have an as of 11 with source of 0 the immediate #x, an ad of 1 with a destination 2 so the destination is &ADDR. this is an important experiment because when you have 2 x values, a three word instruction basically which one goes with the source and which the destination
0x40b2 0x5a80 0x0120
so the address 0x5a80 which is the destination is the first x to follow the instruction then the source 0x0120 an immediate comes after that.
if it were just an immediate and a register then
c006: 31 40 ff 03 mov #1023, r1 ;#0x03ff
0x4031 0x03FF
0100000000110001
0100 mov
0000 source
0 ad
0 b/w
11 as
0001 dest
as of 11 and source of 0 is #immediate the X is 0x03FF in this case the word that follows. the destination is ad of 0
Register direct. The operand is the contents of Rn
where destination in this case is r1
so the first group Rn, x(Rn), #Rn and #Rn+ are the normal cases, the ones below that that you are asking about are special cases, if you get a combination that fits into a special case then you do that otherwise you do the normal case like the mov immediate to r1 example above. the destination of r1 was a normal Rn case.
As=01, Ad=1, R0 (ADDR): This is exactly the same as x(Rn), i.e., the operand is in memory at address R0+x.
This is used for data that is stored near the code that uses it, when the compiler does not know at which absolute address the code will be located, but it knows that the data is, e.g., twenty words behind the instruction.
As=11, R0 (#x): This is exactly the same as #R0+, and is used for instructions that need a word of data from the instruction stream. For example, this assembler instruction:
MOV #1234, R5
is actually encoded and implemented as:
MOV #PC+, R5
.dw 1234
After the CPU has read the MOV instruction word, PC points to the data word. When reading the first MOV operand, the CPU reads the data word, and increments PC again.
As=01, Ad=1, R2 (&ADDR): this is exactly the same as x(Rn), but the R2 register reads as zero, so what you end up with is the value of x.
Using the always-zero register allows to encode absolute addresses without needing a special addressing mode for this (just a special register).
constants -1/0/1/2/4/8: it would not make sense to use the SR and CG registers with most addressing modes, so these encodings are used to generate special values without a separate data word, to save space:
encoding: what actually happens:
MOV #SR, R5 MOV #4, R5
MOV #SR+, R5 MOV #8, R5
MOV CG, R5 MOV #0, R5
MOV x(CG), R5 MOV #1, R5 (no word for x)
MOV #CG, R5 MOV #2, R5
MOV #CG+, R5 MOV #-1, R5

Decoding BLX instruction on ARM/Thumb(Android)

I want to decoding a blx instruction on arm, and I have found a good answer here:
Decoding BLX instruction on ARM/Thumb (IOS)
But in my case, I follow this tip step by step, and get the wrong result, can anyone tell me why?
This is my test:
.plt: 000083F0 sub_83F0 ...
...
.text:00008436 FF F7 DC EF BLX sub_83F0
I parse the machine code 'FF F7 DC EF' by follow:
F7 FF EF DC
11110 1 1111111111 11 1 0 1 1111101110 0
S imm10H J1 J2 imm10L
I1 = NOT(J1 EOR S) = 1
I2 = NOT(J2 EOR S) = 1
imm32 = SignExtend(S:I1:I2:imm10H:imm10L:00)
= SignExtend(1111111111111111110111000)
= SignExtend(0x1FFFFB8)
= ?
So the offset is 0xFFB8?
But 0x83F0-0X8436-4=0xFFB6
I need your help!!!
When the target of a BLX is 32-bit ARM code, the immediate value encoded in the BLX instruction is added to align(PC,4), not the raw value of PC.
PC during execution of the BLX instruction is 0x8436 + 4 == 0x843a due to the ARM pipeline
align(0x843a, 4) == 0x8438
So:
0x00008438 + 0ffffffb8 == 0x83f0
The ARM ARM mentions this in the assembler syntax for the <label> part of the instruction:
For BLX (encodings T2, A2), the assembler calculates the required value of the offset from the Align(PC,4) value of the BLX instruction to this label, then selects an encoding that sets imm32 to that offset.
The alignment requirement can also be found by careful reading of the Operation pseudocode in the ARM ARM:
if ConditionPassed() then
EncodingSpecificOperations();
if CurrentInstrSet == InstrSet_ARM then
next_instr_addr = PC - 4;
LR = next_instr_addr;
else
next_instr_addr = PC;
LR = next_instr_addr<31:1> : ‘1’;
if toARM then
SelectInstrSet(InstrSet_ARM);
BranchWritePC(Align(PC,4) + imm32); // <--- alignment of the current PC when BLX to non-Thumb ARM code
else
SelectInstrSet(InstrSet_Thumb);
BranchWritePC(PC + imm32);
F7FF
1111011111111111
111 10 11111111111 h = 10 offset upper = 11111111111
EFDC
1110111111011100
111 01 11111011100 h = 01 blx offset upper 11111011100
offset = 1111111111111111011100<<1
sign extended = 0xFFFFFFB8
0x00008436 + 2 + 0xFFFFFFB8 = 1000083F0
clip to 32 bits 0x000083F0

Resources