Why is permute needed in parallel SIMD/SSE/AVX ? - permutation

From my other question about "Using SIMD AVX SSE for tree traversal" ive got this code that im trying to benchmark. I havent done anything with SIMD before so I'm kinda new to this permutation stuff. First, lets see this code:
__m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);
// compare the two halves of the cache line.
__m256i cmp1 = _mm256_load_si256(&node->m256[0]);
__m256i cmp2 = _mm256_load_si256(&node->m256[1]);
cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD
// merge the comparisons back together.
//
// a permute is required to get the pack results back into order
// because AVX-256 introduced that unfortunate two-lane interleave.
//
// alternately, you could pre-process your data to remove the need
// for the permute.
__m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD
// finally create a move mask and count trailing
// zeroes to get an index to the next node.
unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
return _tzcnt_u32(mask) / 2; // TZCNT
The author, Cory Nelson tried to explain it with the comments. However, I'm not really getting how this permutations work and why it does end up to "extract" the wanted information from the result vector.
Could anybody help me out to understand how the permutation, movemask an TZCNT is used in this code and what "packing/unpacking" means in this context ? I'd be thankfull for any resources you might have about it - google aint that helpfull with this very special topic.

Intel's instruction set manuals will be invaluable to your learning of SIMD. It explains in great detail what each of those instructions is doing.
"Packing" in SSE/AVX is basically a downcast and merge of two registers. PACKSSDW packs 32-bit signed ints from two registers into 16-bit signed ints in one register, and saturates the values (so values < -32768 will be set to -32768, and >32767 will be set to 32767)
A permute is a way of reordering the values in a register. Each value in the mask register specifies an index into the source. This is required because AVX256 "cheated" a little and processes most of its mixing instructions as two 128-bit "lanes".
The 128-bit version of PACKSSDW performs this:
r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)
You'd expect the 256-bit version to maintain the same natural ordering with all the "A"s first and the "B"s second, like this:
r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(a4)
r5 := SignedSaturate(a5)
r6 := SignedSaturate(a6)
r7 := SignedSaturate(a7)
r8 := SignedSaturate(b0)
r9 := SignedSaturate(b1)
r10 := SignedSaturate(b2)
r11 := SignedSaturate(b3)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)
But instead, what it actually does this:
r0 := SignedSaturate(a0) // lane one, the low 128 bits.
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)
r8 := SignedSaturate(a4) // lane two, the high 128 bits.
r9 := SignedSaturate(a5)
r10 := SignedSaturate(a6)
r11 := SignedSaturate(a7)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)
The result is that when comparing an array of neatly ordered values, the 128-bit version keeps them ordered while the 256-bit version will mix them. The permute puts them back into order.
As I alluded to in my post, you can get rid of the permute in this code by preprocessing your node's array to have the inverse, so that the "mixed" results of the 256-bit op puts it back in order:
void preprocess_avx2(bnode* const node)
{
__m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
__m256i *const middle = (__m256i*)&node->i32[4];
__m256i x = _mm256_loadu_si256(middle);
x = _mm256_permutevar8x32_epi32(x, perm_mask);
_mm256_storeu_si256(middle, x);
}
The ordering is important because of what it does next.
The compare works on 16 32-bit values, but it results in either 0x0000 or 0xFFFF for all of them. You essentially only have 16 bits of information -- off or on for each value. PMOVMSKB treats the input as 32 8-byte values and packs the high bits of each (which is all we need, since all the bits are the same) into a 32-bit int.
TZCNT counts the trailing zero bits in that int, which gives an index to the first position that has a set bit: the index of the first byte in that SIMD register that compared as greater-than.
(Fun fact: TZCNT is a Haswell improvement over the existing BSF instruction, and in fact shares an encoding with it. The only difference is that TZCNT has a defined register output when its input is 0 -- with BSF you'd need to branch.)

Related

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?
For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.
Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}

Compare two 64 bit variables on 32 bit microcontroller

I have the following issue: I have two 64 bit variables and they have to be compared as quick as possible, my Microcontroller is only 32bit.
My thoughts are that it is necessary to divide 64 bit variable into two 32 bit variables, like this
uint64_t var = 0xAAFFFFFFABCDELL;
hiPart = (uint32_t)((var & 0xFFFFFFFF00000000LL) >> 32);
loPart = (uint32_t)(var & 0xFFFFFFFFLL);
and then to compare hiParts and loParts, but I am sure that this approach is slow and there is much better solution
The first rule should be: Write your program, so that is readable to a human.
When in doubt, don't assume anything, but measure it. Let's see, what godbolt gives us.
#include <stdint.h>
#include <stdbool.h>
bool foo(uint64_t a, uint64_t b) {
return a == b;
}
bool foo2(uint64_t a, uint64_t b) {
uint32_t ahiPart = (uint32_t)((a & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t aloPart = (uint32_t)(a & 0xFFFFFFFFULL);
uint32_t bhiPart = (uint32_t)((b & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t bloPart = (uint32_t)(b & 0xFFFFFFFFULL);
return ahiPart == bhiPart && aloPart == bloPart;
}
foo:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
foo2:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
As you can see, they result in the exact same assembly code, but you decide, which one is less error prone and easiert to read?
There was a time some years ago where you need to do tricks to be more smart than a compiler. But in 99.999% the compiler will be more smart than you.
And your variables are unsigned. So use ULL instead of LL.
The fastest way is to let the compiler do it. Most compilers are much better than humans at micro-optimization.
uint64_t var = …, other_var = …;
if (var == other_var) …
There aren't many ways to go about it. Under the hood, the compiler will arrange to load the upper 32 bits and the lower 32 bits of each variables into registers, and compare the two registers that contain upper 32 bits and the two registers that contain lower 32 bits. The assembly code might look something like this:
load 32 bits from &var into r0
load 32 bits from &other_var into r1
if r0 != r1: goto different
load 32 bits from &var + 4 into r2
load 32 bits from &other_var + 4 into r3
if r2 != r3: goto different
// code for if-equal
different:
// code for if-not-equal
Here are some things the compiler knows better than you:
Which registers to use, based on the needs of the surrounding code.
Whether to reuse the same registers to compare the upper and lower parts, or to use different registers.
Whether to process one part and then the other (as above), or to load one variable then the other. The best order depends on the pressure on registers and on the memory access times and pipelining of the particular processor model.
If you work with a union you could compare Hi and Lo Part without any extra calculations:
typedef union
{
struct
{
uint32_t loPart;
uint32_t hiPart;
};
uint64_t complete;
}uint64T;
uint64T var.complete = 0xAAFFFFFFABCDEULL;

Encoding 3 base-6 digits in 8 bits for unpacking performance

I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.
The obvious method is of course:
x = b%6; // 8 insns
y = b/6%6; // 13 insns
z = b/36; // 5 insns
The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.
Another method (using a different encoding) is:
b *= 6
x = b>>8;
b &= 255;
b *= 6
y = b>>8;
b &= 255;
b *= 6
z = b>>8;
This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.
Are there better approaches?
Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.
As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):
struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
return LUT[b].vals[selector];
}
gcc7 -O3 on Godbolt for x86-64 and AArch64
movzx edi, dil
mov esi, esi # zero-extension to 64-bit: goes away when inlining.
lea rax, LUT[rdi+rdi*2] # multiply by 3 and add the base
movzx eax, BYTE PTR [rax+rsi] # then index by selector
ret
Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).
This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.
In PIC / PIE code, this costs 2 extra instructions, unfortunately:
movzx edi, dil
lea rax, LUT[rip] # RIP-relative LEA instead of absolute as part of another addressing mode
mov esi, esi
lea rdx, [rdi+rdi*2]
add rax, rdx
movzx eax, BYTE PTR [rax+rsi]
ret
But that's still cheap, and all the ALU instructions are single-cycle latency.
Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)
But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).
Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.
add eax, eax ; *2
lea eax, [rax + rax*2] ; *3
movzx ecx, al ; 0 cycle latency on Intel
.. repeat for next steps
ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.
As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).
This is in fact what gcc emits if you write it this way:
unsigned decode_ALU(uint8_t b, unsigned selector) {
uint8_t decoded[3];
uint32_t tmp = b * 6;
decoded[0] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[1] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[2] = tmp >> 8;
return decoded[selector];
}
movzx edi, dil
mov esi, esi
lea eax, [rdi+rdi*2]
add eax, eax
mov BYTE PTR -3[rsp], ah # store high half of mul-by-6
movzx eax, al # costs 1 cycle: gcc doesn't know about zero-latency movzx?
lea eax, [rax+rax*2]
add eax, eax
mov BYTE PTR -2[rsp], ah
movzx eax, al
lea eax, [rax+rax*2]
shr eax, 7
mov BYTE PTR -1[rsp], al
movzx eax, BYTE PTR -3[rsp+rsi]
ret
The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.
So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.
Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:
(b * (int[]){2048,342,57}[i] >> 11) % 6;
where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.
On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.

Adding two n-bytre integers to produce an n-byte answer in 6502?

I'm having another issue with addition in 6502....
I am attempting to add two n-byte integers to produce an n-byte result. I'm not completely sure if I understand the 6502 chip as much as I should for this project so any feedback on my current code would be extremely helpful.
I know I am supposed to be using INX (increment the x register) and DEY (decrement the y register) but I am unsure of the placement of the opcodes.
Description:
Add two n-byte integers using absolute indexed addressing
Adding two n-byte integers using absolute indexed addressing 
The addends start at memory locations $xxxx, $yyyy, answer is at $zzzz
Byte length of the integers is at $AAAA (¢—>256)
START = $0500
CLC
____
loop LDA $0400, x
ADC $0410, x
STA $0412, x
____
BNE loop
BRK
LDA, ADC, and STA are outside the loop (first time using loops in assembly)
EDIT:
Variables
A1 = $0600
B1 = $0700
B2 = $0800
Z1 = $0900
[START] = $0500
CLC 18
LDX AE
LDY A1 AC
loop: LDA B1, x BD
ADC B2, x 7D
STA Z1, x 9D
INX E8
DEY 88
BNE loop D0
;Adding two n-byte integers using absolute indexed addressing
;The addends start at memory locations $xxxx, $yyyy, answer is at $zzzz
;Byte length of the integers is at $AAAA (¢—>256)
CLC
LDX #0 ; start at the beginning
LDY $AAAA ; load length into Y
loop: LDA $xxxx, X ; load first operand
ADC $yyyy, x ; add second operand
STA $zzzz, x ; store result
INX ; go on to next byte
DEY ; count how many are left
BNE loop ; if more, do more

Extract non-zero values from _m128i register with SSE

I have to extract non-zero values of an __m128i register.
For example I have a vector with eight unsigned shorts.
__m128i vector {40, 0, 22, 0, 0, 0, 0, 8}
I want to extract the 40, 22 and 8 with a minimal amount of SSE instructions.
The non-zero values will then be stored in an array of non zero values.
{40, 22, 8, more values from different vectors ... }
Is it possible to shuffle them or is there a good intrinsic to extract and store?
If you look at this paper, the authors describe how to use _mm_cmpestrm instruction to do basically want you want. The core of their algorithm is this (which I've modified slightly to do what you want, instead of what they want):
__m128i res_v = _mm_cmpestrm(
vector,
8,
mm_setzero_si128(),
8,
_SIDD_UWORD_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_BIT_MASK|_SIDD_NEGATIVE_POLARITY);
int r = _mm_extract_epi32(res_v, 0);
__m128i p = _mm_shuffle_epi8(vector, sh_mask[r]);
If you build the look-up table sh_mask as described in the paper, then p should have the non-zero elements (without any re-ordering) followed by the zero elements. The number of bits set in r will tell you the number of non-zero elements.
_mm_cmpestrm is in SSE4, unfortunately.
Based on anjruu's answer, here's an SSSE3 version that has not been tested in any way:
; xmm0 = input
pxor xmm1, xmm1
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
shl eax, 4
pshufb xmm0, [table + eax]
The table is different of course, but not that hard to work out, just keep in mind that the index is "inverted" - eg index 0 corresponds to having no zeros and 0xFFFF corresponds to all zeros.

Resources