A64 Neon SIMD - 256-bit comparison - arm

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.
Equality (=)
For equality, I already got a solution:
bool eq256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
First, load the two values into SIMD registers.
__asm__("ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
Compare each 64-bit limb of the values with each other. This results in -1 (all bits set) for those limbs that are equal, and 0 (all bits clear) if a bit differs.
"cmeq.2d v0, v0, v2 \n\t"
"cmeq.2d v1, v1, v3 \n\t"
Reduce the result from 2 vectors to 1 vector, keeping only the one that contains "0 (all bits clear)" if there is any.
"uminp.16b v0, v0, v1 \n\t"
Reduce the result from 1 vector to 1 byte, keeping only a byte with zeroes if there is any.
"uminv.16b b0, v0 \n\t"
Move to ARM register, then compare with 0xFF. This is the result.
"umov %w0, v0.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq "
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "cc");
return result;
}
Questions
Is this more efficient than doing the 4 comparisons with plain old ARM registers?
e.g. is there a source that quotes timings for the different operations? I'm doing this on iPhone 5s.
Is there a way to optimize this even further? I think that I waste many cycles just to reduce the whole vector to a single scalar boolean.
Less Than comparison (<)
Let's represent the two ints as tuples of 64-bit limbs (little-endian):
lhs = (l0, l1, l2, l3)
rhs = (r0, r1, r2, r3)
Then, lhs < rhs, if this evaluates to true:
(l3 < r3) & 1 & 1 & 1 |
(l3 = r3) & (l2 < r2) & 1 & 1 |
(l3 = r3) & (l2 = r2) & (l1 < r1) & 1 |
(l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
SIMD instructions can now be used to evaluate multiple operands at a time. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is the way that the two 256-bit numbers are stored, we can easily get all of the required values (useful values in bold):
cmlo.2d => (l1 < r1), (l2 < r2)
cmlo.2d => (l3 < r3), (l4 < r4)
cmeq.2d => (l1 = r1), (l2 = r2)
cmeq.2d => (l3 = r3), (l4 = r4)
Questions
With these values in four SIMD registers, I now wonder what the best strategy is to apply the & and | operators, and then reducing it to a single boolean.
Update
I just punched together a working implementation for "less than".
Basically, I replaced the 1s above with a duplicate condition, because A & A == A & 1.
Then, I lay out the three 2x2 squares in my matrix, and bitwise AND them.
Now, I reduce with bitwise ORs - first from two vectors to one vector, then to one byte, then copy to ARM register, and test for 0xFF. Same pattern as for equality above.
Question above is still valid. I'm not sure if the code is optimal yet, and wonder if I missed some general SIMD pattern to do such stuff more efficiently. Also: Is NEON worth it for such cases, when the input operands come from memory?
bool lt256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
__asm__(// (l3 < r3) & (l3 < r3) |
// (l3 = r3) & (l2 < r2) |
// (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) |
// (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
"ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
// v0: [ l3 = r3 ] [ l2 = r2 ]
// v1: [ l0 < r0 ] [ l1 < r1 ]
// v2: [ l0 = r0 ] [ l1 = r1 ]
// v3: [ l2 < r2 ] [ l3 < r3 ]
// v4: [ l2 = r2 ] [ l3 = r3 ]
"cmeq.2d v4, v1, v3 \n\t"
"cmlo.2d v3, v1, v3 \n\t"
"cmlo.2d v1, v0, v2 \n\t"
"cmeq.2d v2, v0, v2 \n\t"
"ext.16b v0, v4, v4, 8 \n\t"
// v2: [ l1 < r1 ] [ l1 = r1 ]
// v1: [ l1 < r1 ] [ l0 < r0 ]
"trn2.2d v2, v1, v2 \n\t"
"ext.16b v1, v1, v1, 8 \n\t"
// v1: [ l1 < r1 & l1 < r1 ] [ l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v2: [ l3 < r3 ] [ l3 = r3 ]
// v3: [ l3 < r3 ] [ l2 < r2 ]
"ext.16b v2, v3, v0, 8 \n\t"
"ext.16b v3, v3, v3, 8 \n\t"
// v3: [ l3 < r3 & l3 < r3 ] [ l3 = r3 & l2 < r2 ]
"and.16b v3, v2, v3 \n\t"
// v2: [ l3 = r3 ] [ l3 = r3 ]
// v4: [ l2 = r2 ] [ l2 = r2 ]
"ext.16b v2, v4, v0, 8 \n\t"
"ext.16b v4, v0, v4, 8 \n\t"
// v2: [ l3 = r3 & l2 = r2 ] [ l3 = r3 & l2 = r2 ]
"and.16b v2, v2, v4 \n\t"
// v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"orr.16b v1, v3, v1 \n\t"
// b1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 |
// l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"umaxv.16b b1, v1 \n\t"
"umov %w0, v1.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq"
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "v4", "cc");
return result;
}

Benchmark with XCTest measureMetrics with a Swift-based test runner. Two 256-bit Ints are allocated. Then, an operation is repeated 100 million times on the same two ints, measurement is stopped, and each limb of the two ints is assigned a new random value with arc4random. A second run is performed with Instruments attached, and the CPU time distribution is noted for each instruction as a comment right next to it.
Equality (==)
For equality, SIMD seems to lose when the result is transferred from the SIMD registers back to the ARM register. SIMD is probably only worth it when the result is used in further SIMD calculations, or if longer ints than 256-bit are used (ld1 seems to be faster than ldp).
SIMD
bool result;
__asm__("ld1.2d { v0, v1 }, %1 \n\t" // 5.1%
"ld1.2d { v2, v3 }, %2 \n\t" // 26.4%
"cmeq.2d v0, v0, v2 \n\t"
"cmeq.2d v1, v1, v3 \n\t"
"uminp.16b v0, v0, v1 \n\t" // 4.0%
"uminv.16b b0, v0 \n\t" // 26.7%
"umov %w0, v0.b[0] \n\t" // 32.9%
"cmp %w0, 0xFF \n\t" // 0.0%
"cset %w0, eq "
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "cc");
return result; // 4.9% ("ret")
measured [Time, seconds] average: 11.558, relative standard deviation: 0.065%, values: [11.572626, 11.560558, 11.549322, 11.568718, 11.558530, 11.550490, 11.557086, 11.551803, 11.557529, 11.549782]
Standard
The winner here. The ccmp instruction really shines here :-)
It is clear, though, that the problem is memory bound.
bool result;
__asm__("ldp x8, x9, %1 \n\t" // 33.4%
"ldp x10, x11, %2 \n\t"
"cmp x8, x10 \n\t"
"ccmp x9, x11, 0, eq \n\t"
"ldp x8, x9, %1, 16 \n\t" // 34.1%
"ldp x10, x11, %2, 16 \n\t"
"ccmp x8, x10, 0, eq \n\t" // 32.6%
"ccmp x9, x11, 0, eq \n\t"
"cset %w0, eq \n\t"
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "x8", "x9", "x10", "x11", "cc");
return result;
measured [Time, seconds] average: 11.146, relative standard deviation: 0.034%, values: [11.149754, 11.142854, 11.146840, 11.149392, 11.141254, 11.148708, 11.142293, 11.150491, 11.139593, 11.145873]
C
LLVM fails to detect that "ccmp" is a good instruction to use here, and is slower than the asm version above.
return
lhs->value[0] == rhs->value[0] &
lhs->value[1] == rhs->value[1] &
lhs->value[2] == rhs->value[2] &
lhs->value[3] == rhs->value[3];
Compiled to
ldp x8, x9, [x0] // 24.1%
ldp x10, x11, [x1] // 0.1%
cmp x8, x10 // 0.4%
cset w8, eq // 1.0%
cmp x9, x11 // 23.7%
cset w9, eq
and w8, w8, w9 // 0.1%
ldp x9, x10, [x0, #16]
ldp x11, x12, [x1, #16] // 24.8%
cmp x9, x11
cset w9, eq // 0.2%
and w8, w8, w9
cmp x10, x12 // 0.3%
cset w9, eq // 25.2%
and w0, w8, w9
ret // 0.1%
measured [Time, seconds] average: 11.531, relative standard deviation: 0.040%, values: [11.525511, 11.529820, 11.541940, 11.531776, 11.533287, 11.526628, 11.531392, 11.526037, 11.531784, 11.533786]
Less Than (<)
(to be determined - will update later)

Since the simple scalar ccmp implementation was the winner for the equality test, here's an equally simple scalar solution for less-than.
The approach for less-than above was based on lexicographic comparison, starting with the most significant limbs. I didn't see a good way to do that with ccmp. The problem is that in a branchless lexicographic compare, there are three possible states at each step: a previous limb compared less, a previous limb compared greater, all previous limbs compared equal. ccmp can't really keep track of three states. We could do it if ccmp's behavior when its condition is false were "do nothing", as with ARM32 conditionals, instead of "load flags with immediate".
So instead, here's an even more basic approach: do a multi-precision subtract, and check the carry flag at the end.
inline bool lt256(const uint64_t *a, const uint64_t *b) {
const int limbs = 4; // number of 64-bit chunks in a full number
uint64_t a0,a1,b0,b1; // for scratch registers
bool ret;
asm(R"(
ldp %[a0], %[a1], [%[a]]
ldp %[b0], %[b1], [%[b]]
subs xzr, %[a0], %[b0]
sbcs xzr, %[a1], %[b1]
ldp %[a0], %[a1], [%[a], #16]
ldp %[b0], %[b1], [%[b], #16]
sbcs xzr, %[a0], %[b0]
sbcs xzr, %[a1], %[b1]
)"
: "=#cclo" (ret),
[a0] "=&r" (a0), [a1] "=&r" (a1), [b0] "=&r" (b0), [b1] "=&r" (b1)
: [a] "r" (a), [b] "r" (b),
"m" (*(const uint64_t (*)[limbs])a),
"m" (*(const uint64_t (*)[limbs])b)
);
return ret;
}
I chose to use a flag output operand for the result instead of explicitly writing a boolean to a register. (This feature didn't exist in a stable GCC release when the previous answer was written, and is still not supported by Clang for AArch64.) This could save an instruction, and a register, if the result will be branched on.
I also chose to do the loads within the asm. We could also use eight input operands and have the compiler do the loads, but then we would need eight registers instead of 4-6 as it stands. Worth trying if there is reason to think the limbs are already in general-purpose registers. Alternatively, you could reduce the register usage further by loading one pair of limbs at a time, instead of two, but at the cost of larger and probably slower code.
The zero register provides a convenient way to discard the numerical results of the subtractions, since we don't need them.
Performance should be pretty similar to the ccmp-based eq256, as both of them are essentially four subtracts in a dependency chain. Taking Cortex-A72 as an example, cmp/ccmp and subs/sbcs are all single-uop instructions that can execute on either of two integer pipelines. They don't say whether the flags are renamed, but if they are then you should be able to write two of these chains in series and have them execute in parallel.

Related

ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?

I am working on a task to convert a large binary label image, which has 8 bits (uint8_t) per pixel and each pixel can only be 0 or 1 (or 255), to an array of uint64_t numbers and each bit in uint64_t number represent a label pixel.
For example,
input array: 0 1 1 0 ... (00000000 00000001 00000001 00000000 ...)
or input array: 0 255 255 0 ... (00000000 11111111 11111111 00000000 ...)
output array (number): 6 (because after convert each uint8_t to bit, it becomes 0110)
Currently the C code to achieve this is:
for (int j = 0; j < width >> 6; j++) {
uint8_t* in_ptr= in + (j << 6);
uint64_t out_bits = 0;
if (in_ptr[0]) out_bits |= 0x0000000000000001;
if (in_ptr[1]) out_bits |= 0x0000000000000002;
.
.
.
if (in_ptr[63]) out_bits |= 0x8000000000000000;
*output = obits; output ++;
}
Can ARM NEON optimize this functionality? Please help. Thank you!
Assuming the input value is either 0 or 255, below is the basic version which is rather straightforward, especially for people with Intel SSE/AVX experience.
void foo_basic(uint8_t *pDst, uint8_t *pSrc, intptr_t length)
{
//assert(length >= 64);
//assert(length & 7 == 0);
uint8x16_t in0, in1, in2, in3;
uint8x8_t out;
const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
length -= 64;
do {
do {
in0 = vld1q_u8(pSrc); pSrc += 16;
in1 = vld1q_u8(pSrc); pSrc += 16;
in2 = vld1q_u8(pSrc); pSrc += 16;
in3 = vld1q_u8(pSrc); pSrc += 16;
in0 &= mask;
in1 &= mask;
in2 &= mask;
in3 &= mask;
in0 = vpaddq_u8(in0, in1);
in2 = vpaddq_u8(in2, in3);
in0 = vpaddq_u8(in0, in2);
out = vpadd_u8(vget_low_u8(in0), vget_high_u8(in0));
vst1_u8(pDst, out); pDst += 8;
length -= 64;
} while (length >=0);
pSrc += length>>3;
pDst += length;
} while (length > -64);
}
Neon however has VERY user friendly and efficient permutation and bit operation instructions that allow to go "vertical"
void foo_advanced(uint8_t *pDst, uint8_t *pSrc, intptr_t length)
{
//assert(length >= 128);
//assert(length & 7 == 0);
uint8x16x4_t in0, in1;
uint8x16x2_t row04, row15, row26, row37;
length -= 128;
do {
do {
in0 = vld4q_u8(pSrc); pSrc += 64;
in1 = vld4q_u8(pSrc); pSrc += 64;
row04 = vuzpq_u8(in0.val[0], in1.val[0]);
row15 = vuzpq_u8(in0.val[1], in1.val[1]);
row26 = vuzpq_u8(in0.val[2], in1.val[2]);
row37 = vuzpq_u8(in0.val[3], in1.val[3]);
row04.val[0] = vsliq_n_u8(row04.val[0], row15.val[0], 1);
row26.val[0] = vsliq_n_u8(row26.val[0], row37.val[0], 1);
row04.val[1] = vsliq_n_u8(row04.val[1], row15.val[1], 1);
row26.val[1] = vsliq_n_u8(row26.val[1], row37.val[1], 1);
row04.val[0] = vsliq_n_u8(row04.val[0], row26.val[0], 2);
row04.val[1] = vsliq_n_u8(row04.val[1], row26.val[1], 2);
row04.val[0] = vsliq_n_u8(row04.val[0], row04.val[1], 4);
vst1q_u8(pDst, row04.val[0]); pDst += 16;
length -= 128;
} while (length >=0);
pSrc += length>>3;
pDst += length;
} while (length > -128);
}
The Neon-only advanced version is shorter and faster, but GCC is extremely bad at dealing with Neon specific permutation instructions such as vtrn, vzip, and vuzp.
https://godbolt.org/z/bGdbohqKe
Clang isn't any better: it spams unnecessary vorr where GCC does the same with vmov.
.syntax unified
.arm
.arch armv7-a
.fpu neon
.global foo_asm
.text
.func
.balign 64
foo_asm:
sub r2, r2, #128
.balign 16
1:
vld4.8 {d16, d18, d20, d22}, [r1]!
vld4.8 {d17, d19, d21, d23}, [r1]!
vld4.8 {d24, d26, d28, d30}, [r1]!
vld4.8 {d25, d27, d29, d31}, [r1]!
subs r2, r2, #128
vuzp.8 q8, q12
vuzp.8 q9, q13
vuzp.8 q10, q14
vuzp.8 q11, q15
vsli.8 q8, q9, #1
vsli.8 q10, q11, #1
vsli.8 q12, q13, #1
vsli.8 q14, q15, #1
vsli.8 q8, q10, #2
vsli.8 q12, q14, #2
vsli.8 q8, q12, #4
vst1.8 {q8}, [r0]!
bpl 1b
add r1, r1, r2
cmp r2, #-128
add r0, r0, r2, asr #3
bgt 1b
.balign 8
bx lr
.endfunc
.end
The inner most loop consists of :
GCC: 32 instructions
Clang: 30 instructions
Asm: 18 instructions
It doesn't take rocket science to figure out which one is the fastest and by how much: Never trust compilers if you are about to do permutations.
Standing on the shoulder of Jake 'Alquimista' LEE, we can improve the unzipping instruction and the algorithm as well by changing the order of the zip and vlsi operators:
#define interleave_nibbles(top) \
top.val[0] = vsliq_n_u8(top.val[0], top.val[1],1);\
top.val[2] = vsliq_n_u8(top.val[2], top.val[3],1);\
top.val[0] = vsliq_n_u8(top.val[0], top.val[2],2);
void transpose_bits(uint8_t const *src, uint8_t *dst) {
uint8x16x4_t top = vld4q_u8(src);
uint8x16x4_t bot = vld4q_u8(src + 64); src+=128;
interleave_nibbles(top);
interleave_nibbles(bot);
// now we have 4 bits correct in each of the 32 bytes left
// top = 0to3 4to7 8to11 12to15 ...
// bot = 64to67 68to71 ...
uint8x16x2_t top_bot = vuzpq_u8(top.val[0], bot.val[0]);
uint8x16_t result = vsliq_n_u8(top_bot.val[0], top_bot.val[1], 4);
vst1q_u8(dst, result); dst += 16;
}
The produced assembler by clang has now only two extraneous movs (by or) and gcc output has four movs.
vld4.8 {d16, d18, d20, d22}, [r0]!
vld4.8 {d17, d19, d21, d23}, [r0]!
vld4.8 {d24, d26, d28, d30}, [r0]!
vsli.8 q10, q11, #1
vorr q0, q8, q8
vld4.8 {d25, d27, d29, d31}, [r0]
vsli.8 q0, q9, #1
vorr q2, q14, q14
vsli.8 q12, q13, #1
vsli.8 q2, q15, #1
vsli.8 q0, q10, #2
vsli.8 q12, q2, #2
vuzp.8 q0, q12
vsli.8 q0, q12, #4
vst1.8 {d0, d1}, [r1]
And the arm64 version looks perfect with only 12 instructions.
ld4 { v0.16b, v1.16b, v2.16b, v3.16b }, [x0], #64
ld4 { v4.16b, v5.16b, v6.16b, v7.16b }, [x0]
sli v0.16b, v1.16b, #1
sli v2.16b, v3.16b, #1
sli v0.16b, v2.16b, #2
sli v4.16b, v5.16b, #1
sli v6.16b, v7.16b, #1
sli v4.16b, v6.16b, #2
uzp1 v16.16b, v0.16b, v4.16b
uzp2 v0.16b, v0.16b, v4.16b
sli v16.16b, v0.16b, #4
str q16, [x1]
You can do it more efficiently (especially for short arrays or single vectors) using something like this (in this example, turning one 128 bit register into one 16 bit mask):
// turn mask of bytes in v0 into mask of bits in w0
movmsk: adr x0, 0f // obtain address of literal
ld1r {v1.2d}, [x0] // load 80..01 mask twice into v1
and v0.16b, v0.16b, v1.16b // mask bytes from ff to single bits
mov d1, v0.d[1] // extract high 64 bit
zip1 v0.8b, v0.8b, v1.8b // interleave high and low bytes
addv h0, v0.8h // sum into bit mask
mov w0, v0.s[0] // move result to general register
ret
0: .quad 0x8040201008040201
The idea is to turn the contents of each byte into just one bit at the bit position it's going to end up at and to then sum up the bits using addv (8 bytes at a time, resulting in one byte of output).
Putting a loop around this code to have it traverse the entire array is left as an exercise to the reader.

XC16 Disassembly for (uint32) & (uint32) operation

I am in the process of trying to figure out how many cycles some uint32 operations will take on a 16bit dsPIC. I started with bitwise AND and wrote the following program:
int main(void) {
unsigned long var1, var2, var3;
var1 = 80000ul;
var2 = 190000ul;
while (1) {
var3 = var1 & var2;
}
var1 = 0;
return 0;
}
Looking at the disassembly to see what the compiler came up with for the assembly I got the following:
! var3 = var1 & var2;
0x2DE: MOV [W14+4], W0
0x2E0: MOV [W14+6], W1
0x2E2: MOV.D [W14], W2
0x2E4: MOV W2, W4
0x2E6: MOV W3, W2
0x2E8: MOV W0, W3
0x2EA: MOV W1, W0
0x2EC: AND W4, W3, W4
0x2EE: AND W2, W0, W0
0x2F0: CLR W1
0x2F2: SL W0, #0, W1
0x2F4: MOV #0x0, W0
0x2F6: MOV.D W0, W2
0x2F8: MUL.UU W4, #1, W0
0x2FA: IOR W2, W0, W2
0x2FC: IOR W3, W1, W3
0x2FE: MOV W2, [W14+8]
0x300: MOV W3, [W14+10]
20 cycles, 6 I/O moves and 14 core. This looks bonkers to me. Couldn't it just do this?
MOV.D [W14+4], W0
MOV.D [W14], W2
AND W0, W2, W0
AND W1, W3, W1
MOV.D W0, [W14+8]
That drops core cycles to 2 for the core which makes logical sense to me at least (2 16-bit-wide AND's). What is the compiler up to that I don't understand?

Is there a trick to make GCC optimize away redundant instructions?

Compiling with gcc -mcpu=cortex-m0 -mthumb -Os
emits redundant instructions like in this illustrative example:
void memzero(void* p, int n)
{
n -= 4;
do
{
*(int*)((char*)p + n) = 0;
n -= 4;
}
while(n > 0);
}
Results in:
memzero:
movs r3, #0
subs r1, r1, #4
.L2:
str r3, [r0, r1]
subs r1, r1, #4
cmp r1, #0
bgt .L2
bx lr
Obviously, the explicit compare is essentially a nop. Is there some way to turn on more optimization to fix this?
Removing the compare would change the behavior of the function.
The BGT instruction jumps if Z == 0 and N == V. This is important when n overflows.
Consider calling the function with n = -2147483644 (if int is 32 bit):
memzero:
movs r3, #0
subs r1, r1, #4 ; n = -2147483648
.L2:
str r3, [r0, r1]
subs r1, r1, #4 ; n = 2147483644, Z = 0, N = 0, V = 1
;cmp r1, #0 ; (would set Z = 0, N = 0, V = 0)
bgt .L2 ; doesn't jump, even though n is positive
bx lr
The optimization works if we test for n >= 0 because there is an instruction that jumps if N == 0:
memzero:
movs r3, #0
subs r1, r1, #4
.L2:
str r3, [r0, r1]
subs r1, r1, #4
bpl .L2
bx lr
Test program
#include <stdio.h>
#include <limits.h>
__attribute__((noinline)) int with_cmp(int n) {
asm("L1:\n\t"
"subs %[n], #4\n\t"
"cmp %[n], #0\n\t"
"bgt L1"
: [n] "+r" (n));
return n;
}
__attribute__((noinline)) int without_cmp(int n) {
asm("L2:\n\t"
"subs %[n], #4\n\t"
"bgt L2"
: [n] "+r" (n));
return n;
}
int main() {
printf("with cmp: %d\nwithout cmp: %d\n", with_cmp(INT_MIN), without_cmp(INT_MIN));
}
Output:
with cmp: 0 // loops as long as n > 0
without cmp: 2147483644 // immediately returns with positive n
+6 nzCv nzCv
+5 nzCv nzCv
+4 nZCv nZCv
+3 Nzcv NzCv
+2 Nzcv NzCv
+1 Nzcv NzCv
+0 Nzcv NzCv
-1 NzCv NzCv
-2 NzCv NzCv
-3 NzCv NzCv
are the flag choices through the sub and cmp operations for that value of N. When N hits 4 the code tells it to stop. 4-4 = 0 stop when N <= 0.
+4 nZCv nZCv
so signed BGT Z = 0, N == V is a choice, but yes that does work for both subs and the cmp. This is a missed peephole optimization which you are free to investigate or report. I can't imagine this has not been reported to date, unless it was recently added unintentionally.
+3 Nzcv NzCv
is what is mentioned in another answer where if the comparison is changed to N != 0 then the N flag alone determines the boundary.
If you try this you can do an incrementing loop and still run into this. I believe we have seen this asked at Stack Overflow before, I don't know what to look for. Perhaps that one it didn't appear to be a possible optimization.

Is it possible to check if any of 2 sets of 3 ints is equal with less than 9 comparisons?

int eq3(int a, int b, int c, int d, int e, int f){
return a == d || a == e || a == f
|| b == d || b == e || b == f
|| c == d || c == e || c == f;
}
This function receives 6 ints and returns true if any of the 3 first ints is equal to any of the 3 last ints. Is there any bitwise-hack similar way to make it faster?
Assuming you're expecting a high rate of false results you could make a quick "pre-check" to quickly isolate such cases:
If a bit in a is set that isn't set in any of d, e and f then a cannot be equal to any of these.
Thus something like
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
could speed it up (8 operations instead of 9 17, but much more costly if the result will actually be true). If mask == 0 then of course this won't help.
This can be further improved if with high probability a & b & c has some bits set:
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & b & c) & mask) {
return false;
}
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
Now if all of a, b and c have bits set where none of d, e and c have any bits set we're out pretty fast.
Expanding on dawg's SSE comparison method, you can combine the results of the comparisons using a vector OR, and move a mask of the compare results back to an integer to test for 0 / non-zero.
Also, you can get data into vectors more efficiently (although it's still pretty clunky to get many separate integers into vectors when they're live in registers to start with, rather than sitting in memory).
You should avoid store-forwarding stalls that result from doing three small stores and one big load.
///// UNTESTED ////////
#include <immintrin.h>
int eq3(int a, int b, int c, int d, int e, int f){
// Use _mm_set to let the compiler worry about getting integers into vectors
// Use -mtune=intel or gcc will make bad code, though :(
__m128i abcc = _mm_set_epi32(0,c,b,a); // args go from high to low position in the vector
// masking off the high bits of the result-mask to avoid false positives
// is cheaper than repeating c (to do the same compare twice)
__m128i dddd = _mm_set1_epi32(d);
__m128i eeee = _mm_set1_epi32(e);
dddd = _mm_cmpeq_epi32(dddd, abcc);
eeee = _mm_cmpeq_epi32(eeee, abcc); // per element: 0(unequal) or -1(equal)
__m128i combined = _mm_or_si128(dddd, eeee);
__m128i ffff = _mm_set1_epi32(f);
ffff = _mm_cmpeq_epi32(ffff, abcc);
combined = _mm_or_si128(combined, ffff);
// results of all the compares are ORed together. All zero only if there were no hits
unsigned equal_mask = _mm_movemask_epi8(combined);
equal_mask &= 0x0fff; // the high 32b element could have false positives
return equal_mask;
// return !!equal_mask if you want to force it to 0 or 1
// the mask tells you whether it was a, b, or c that had a hit
// movmskps would return a mask of just 4 bits, one for each 32b element, but might have a bypass delay on Nehalem.
// actually, pmovmskb apparently runs in the float domain on Nehalem anyway, according to Agner Fog's table >.<
}
This compiles to pretty nice asm, pretty similar between clang and gcc, but clang's -fverbose-asm puts nice comments on the shuffles. Only 19 instructions including the ret, with a decent amount of parallelism from separate dependency chains. With -msse4.1, or -mavx, it saves another couple of instructions. (But probably doesn't run any faster)
With clang, dawg's version is about twice the size. With gcc, something bad happens and it's horrible (over 80 instructions. Looks like a gcc optimization bug, since it looks worse than just a straightforward translation of the source). Even clang's version spends so long getting data into / out of vector regs that it might be faster to just do the comparisons branchlessly and OR the truth values together.
This compiles to decent code:
// 8bit variable doesn't help gcc avoid partial-register stalls even with -mtune=core2 :/
int eq3_scalar(int a, int b, int c, int d, int e, int f){
char retval = (a == d) | (a == e) | (a == f)
| (b == d) | (b == e) | (b == f)
| (c == d) | (c == e) | (c == f);
return retval;
}
Play around with how to get the data from the caller into vector regs.
If the groups of three are coming from memory, then prob. passing pointers so a vector load can get them from their original location is best. Going through integer registers on the way to vectors sucks (higher latency, more uops), but if your data is already live in regs it's a loss to do integer stores and then vector loads. gcc is dumb and follows the AMD optimization guide's recommendation to bounce through memory, even though Agner Fog says he's found that's not worth it even on AMD CPUs. It's definitely worse on Intel, and apparently a wash or maybe still worse on AMD, so it's definitely the wrong choice for -mtune=generic. Anyway...
It's also possible to do 8 of our 9 compares with just two packed-vector compares.
The 9th can be done with an integer compare, and have its truth value ORed with the vector result. On some CPUs (esp. AMD, and maybe Intel Haswell and later) not transferring one of the 6 integers to vector regs at all might be a win. Mixing three integer branchless-compares in with the vector shuffles / compares would interleave them nicely.
These vector comparisons can be set up by using shufps on integer data (since it can combine data from two source registers). That's fine on most CPUs, but requires a lot of annoying casting when using intrinsics instead of actual asm. Even if there is a bypass delay, it's not a bad tradeoff vs. something like punpckldq and then pshufd.
aabb ccab
==== ====
dede deff
c==f
with asm something like:
#### untested
# pretend a is in eax, and so on
movd xmm0, eax
movd xmm1, ebx
movd xmm2, ecx
shl rdx, 32
#mov edi, edi # zero the upper 32 of rdi if needed, or use shld instead of OR if you don't care about AMD CPUs
or rdx, rdi # de in an integer register.
movq xmm3, rdx # de, aka (d<<32)|e
# in 32bit code, use a vector shuffle of some sort to do this in a vector reg, or:
#pinsrd xmm3, edi, 1 # SSE4.1, and 2 uops (same as movd+shuffle)
#movd xmm4, edi # e
movd xmm5, esi # f
shufps xmm0, xmm1, 0 # xmm0=aabb (low dword = a; my notation is backwards from left/right vector-shift perspective)
shufps xmm5, xmm3, 0b01000000 # xmm5 = ffde
punpcklqdq xmm3, xmm3 # broadcast: xmm3=dede
pcmpeqd xmm3, xmm0 # xmm3: aabb == dede
# spread these instructions out between vector instructions, if you aren't branching
xor edx,edx
cmp esi, ecx # c == f
#je .found_match # if there's one of the 9 that's true more often, make it this one. Branch mispredicts suck, though
sete dl
shufps xmm0, xmm2, 0b00001000 # xmm0 = abcc
pcmpeqd xmm0, xmm5 # abcc == ffde
por xmm0, xmm3
pmovmskb eax, xmm0 # will have bits set if cmpeq found any equal elements
or eax, edx # combine vector and scalar compares
jnz .found_match
# or record the result instead of branching on it
setnz dl
This is also 19 instructions (not counting the final jcc / setcc), but one of them is an xor-zeroing idiom, and there are other simple integer instructions. (Shorter encoding, some can run on port6 on Haswell+ which can't handle vector instructions). There might be a longer dep chain due to the chain of shuffles that builds abcc.
If you want a bitwise version look to xor. If you xor two numbers that are the same the answer will be 0. Otherwise, the bits will flip if one is set and the other is not. For example 1000 xor 0100 is 1100.
The code you have will likely cause at least 1 pipeline flush but apart from that it will be ok performance wise.
I think using SSE is probably worth investigating.
It has been 20 years since I wrote any, and not benchmarked, but something like:
#include <xmmintrin.h>
int cmp3(int32_t a, int32_t b, int32_t c, int32_t d, int32_t e, int32_t f){
// returns -1 if any of a,b,c is eq to any of d,e,f
// returns 0 if all a,b,c != d,e,f
int32_t __attribute__ ((aligned(16))) vec1[4];
int32_t __attribute__ ((aligned(16))) vec2[4];
int32_t __attribute__ ((aligned(16))) vec3[4];
int32_t __attribute__ ((aligned(16))) vec4[4];
int32_t __attribute__ ((aligned(16))) r1[4];
int32_t __attribute__ ((aligned(16))) r2[4];
int32_t __attribute__ ((aligned(16))) r3[4];
// fourth word is DNK
vec1[0]=a;
vec1[1]=b;
vec1[2]=c;
vec2[0]=vec2[1]=vec2[2]=d;
vec3[0]=vec3[1]=vec3[2]=e;
vec4[0]=vec4[1]=vec4[2]=f;
__m128i v1 = _mm_load_si128((__m128i *)vec1);
__m128i v2 = _mm_load_si128((__m128i *)vec2);
__m128i v3 = _mm_load_si128((__m128i *)vec3);
__m128i v4 = _mm_load_si128((__m128i *)vec4);
// any(a,b,c) == d?
__m128i vcmp1 = _mm_cmpeq_epi32(v1, v2);
// any(a,b,c) == e?
__m128i vcmp2 = _mm_cmpeq_epi32(v1, v3);
// any(a,b,c) == f?
__m128i vcmp3 = _mm_cmpeq_epi32(v1, v4);
_mm_store_si128((__m128i *)r1, vcmp1);
_mm_store_si128((__m128i *)r2, vcmp2);
_mm_store_si128((__m128i *)r3, vcmp3);
// bit or the first three of each result.
// might be better with SSE mask, but I don't remember how!
return r1[0] | r1[1] | r1[2] |
r2[0] | r2[1] | r2[2] |
r3[0] | r3[1] | r3[2];
}
If done correctly, SSE with no branches should be 4x to 8x faster.
If your compiler/architecture supports vector extensions (like clang and gcc) you can use something like:
#ifdef __SSE2__
#include <immintrin.h>
#elif defined __ARM_NEON
#include <arm_neon.h>
#elif defined __ALTIVEC__
#include <altivec.h>
//#elif ... TODO more architectures
#endif
static int hastrue128(void *x){
#ifdef __SSE2__
return _mm_movemask_epi8(*(__m128i*)x);
#elif defined __ARM_NEON
return vaddlvq_u8(*(uint8x16_t*)x);
#elif defined __ALTIVEC__
typedef __UINT32_TYPE__ v4si __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
return vec_any_ne(*(v4si*)x,(v4si){0});
#else
int *y = x;
return y[0]|y[1]|y[2]|y[3];
#endif
}
//if inputs will always be aligned to 16 add an aligned attribute
//otherwise ensure they are at least aligned to 4
int cmp3( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0;
//start vectors off at 0 and add the int to each element for optimization
//it adds the int to each element, but since we started it at zero,
//a good compiler (not ICC at -O3) will skip the xor and add and just broadcast/whatever
y0 += b[0];
y1 += b[1];
y2 += b[2];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
//now hack off then end since we only need 3
cmp &= (i32x4){0xffffffff,0xffffffff,0xffffffff,0};
return hastrue128(&cmp);
}
int cmp4( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0, y3 = y0;
y0 += b[0];
y1 += b[1];
y2 += b[2];
y3 += b[3];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
tmp = x ==y3;
cmp |= tmp;
return hastrue128(&cmp);
}
On arm64 this compiles to the following branchless code:
cmp3:
ldr q2, [x0]
adrp x2, .LC0
ld1r {v1.4s}, [x1]
ldp w0, w1, [x1, 4]
dup v0.4s, w0
cmeq v1.4s, v2.4s, v1.4s
dup v3.4s, w1
ldr q4, [x2, #:lo12:.LC0]
cmeq v0.4s, v2.4s, v0.4s
cmeq v2.4s, v2.4s, v3.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v2.16b
and v0.16b, v0.16b, v4.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
cmp4:
ldr q2, [x0]
ldp w2, w0, [x1, 4]
dup v0.4s, w2
ld1r {v1.4s}, [x1]
dup v3.4s, w0
ldr w1, [x1, 12]
dup v4.4s, w1
cmeq v1.4s, v2.4s, v1.4s
cmeq v0.4s, v2.4s, v0.4s
cmeq v3.4s, v2.4s, v3.4s
cmeq v2.4s, v2.4s, v4.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v3.16b
orr v0.16b, v0.16b, v2.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
And on ICC x86_64 -march=skylake it produces the following branchless code:
cmp3:
vmovdqu xmm2, XMMWORD PTR [rdi] #27.24
vpbroadcastd xmm0, DWORD PTR [rsi] #34.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #35.17
vpcmpeqd xmm5, xmm2, xmm0 #34.17
vpbroadcastd xmm3, DWORD PTR [8+rsi] #37.16
vpcmpeqd xmm4, xmm2, xmm1 #35.17
vpcmpeqd xmm6, xmm2, xmm3 #37.16
vpor xmm7, xmm4, xmm5 #36.5
vpor xmm8, xmm6, xmm7 #38.5
vpand xmm9, xmm8, XMMWORD PTR __$U0.0.0.2[rip] #40.5
vpmovmskb eax, xmm9 #11.12
ret #41.12
cmp4:
vmovdqu xmm3, XMMWORD PTR [rdi] #46.24
vpbroadcastd xmm0, DWORD PTR [rsi] #51.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #52.17
vpcmpeqd xmm6, xmm3, xmm0 #51.17
vpbroadcastd xmm2, DWORD PTR [8+rsi] #54.16
vpcmpeqd xmm5, xmm3, xmm1 #52.17
vpbroadcastd xmm4, DWORD PTR [12+rsi] #56.16
vpcmpeqd xmm7, xmm3, xmm2 #54.16
vpor xmm8, xmm5, xmm6 #53.5
vpcmpeqd xmm9, xmm3, xmm4 #56.16
vpor xmm10, xmm7, xmm8 #55.5
vpor xmm11, xmm9, xmm10 #57.5
vpmovmskb eax, xmm11 #11.12
ret
And it even works on ppc64 with altivec - though definitely suboptimal
cmp3:
lwa 10,4(4)
lxvd2x 33,0,3
vspltisw 11,-1
lwa 9,8(4)
vspltisw 12,0
xxpermdi 33,33,33,2
lwa 8,0(4)
stw 10,-32(1)
addi 10,1,-80
stw 9,-16(1)
li 9,32
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
vcmpequw 0,1,0
lvewx 10,10,9
xxsel 32,44,43,32
xxspltw 42,42,3
xxspltw 45,45,3
vcmpequw 13,1,13
vcmpequw 1,1,10
xxsel 45,44,43,45
xxsel 33,44,43,33
xxlor 32,32,45
xxlor 32,32,33
vsldoi 1,12,11,12
xxland 32,32,33
vcmpequw. 0,0,12
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
cmp4:
lwa 10,8(4)
lxvd2x 33,0,3
vspltisw 10,-1
lwa 9,12(4)
vspltisw 11,0
xxpermdi 33,33,33,2
lwa 7,0(4)
lwa 8,4(4)
stw 10,-32(1)
addi 10,1,-96
stw 9,-16(1)
li 9,32
stw 7,-64(1)
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
xxspltw 45,45,3
vcmpequw 13,1,13
xxsel 44,43,42,45
lvewx 13,10,9
li 9,80
vcmpequw 0,1,0
xxspltw 45,45,3
xxsel 32,43,42,32
vcmpequw 13,1,13
xxlor 32,32,44
xxsel 45,43,42,45
lvewx 12,10,9
xxlor 32,32,45
xxspltw 44,44,3
vcmpequw 1,1,12
xxsel 33,43,42,33
xxlor 32,32,33
vcmpequw. 0,0,11
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
As you can see from the generated asm, there is still a little room for improvement, but it will compile on risc-v, mips, ppc and other architecture+compiler combinations that support vector extensions.

Neon optimization of interlaced YUYV to gray

I have the following C code that converts an interlaced webcam YUYV to gray:
void convert_yuyv_to_y(const void *src, char *dest) {
int x, y;
char *Y, *gray;
//get only Y component for grayscale from (Y1)(U1,2)(Y2)(V1,2)
for (y = 0; y < CAM_HEIGHT; y++) {
Y = src + (CAM_WIDTH * 2 * y);
gray = dest + (CAM_WIDTH * y);
for (x=0; x < CAM_WIDTH; x += 2) {
gray[x] = *Y;
Y += 2;
gray[x + 1] = *Y;
Y += 2;
}
}
}
Is there a way to optimize such function with some neon instructions?
Here is a starting point. From here you can do cache preloads, loop unrolling, etc. The best performance will happen when more NEON registers are involved to prevent data stalls.
.equ CAM_HEIGHT, 480 # fill in the correct values
.equ CAM_WIDTH, 640
#
# Call from C as convert_yuyv_to_y(const void *src, char *dest);
#
convert_yuyv_to_y:
mov r2,#CAM_HEIGHT
cvtyuyv_top_y:
mov r3,#CAM_WIDTH
cvtyuyv_top_x:
vld2.8 {d0,d1},[r0]! # assumes source width is a multiple of 8
vst1.8 {d0},[r1]! # work with 8 pixels at a time
subs r3,r3,#8 # x+=8
bgt cvtyuyv_top_x
subs r2,r2,#1 # y++
bgt cvtyuyv_top_y
bx lr
(Promoting my comment to answer)
The least amount of instructions to de-interleave data in NEON architecture is achievable with the sequence:
vld2.8 { d0, d1 }, [r0]!
vst1.8 { d0 }, [r1]!
Here r0 is the source pointer, which advances by 16 each time and r1 is the destination pointer, which advances by 8.
Loop unrolling, ability to retrieve up to 4 registers and offset the registers by 2 can give slightly larger maximum throughput. Coupled with alignment by 16 bytes:
start:
vld4.8 { d0, d1, d2, d3 }, [r0:256]
subs r3, r3, #1
vld4.8 { d4, d5, d6, d7 }, [r1:256]
add r0, r0, #64
add r1, r0, #64
vst2.8 { d0, d2 }, [r2:256]!
vst2.8 { d4, d6 }, [r2:128]!
bgt start
(I can't remember if the format vstx.y {regs}, [rx, ro] exists -- here ro is offset register, that post-increments rx)
While memory transfer optimizations can be useful, it's still better to think, if it can be skipped all together, or merged with some calculation. Also this could be the place to consider planar pixel format, which could completely avoid the copying task.

Resources