64bit/32bit division faster algorithm for ARM / NEON? - c

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.
Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?
If any one has some idea, could you please help me out?

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.
The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:
This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)
Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.
This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.
I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:
.section ".text"
.global udiv64
adds r0,r0,r0
adc r1,r1,r1
.rept 31
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
adc r1,r1,r1
cmp r1,r2
subcs r1,r1,r2
adcs r0,r0,r0
bx lr
extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);
int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
int q;
int sign = (a^b) < 0; /* different signs */
uint32_t l,h;
a = a<0 ? -a:a;
b = b<0 ? -b:b;
l = (a << 24);
h = (a >> 8);
q = udiv64 (l,h,b);
if (sign) q = -q;
return q;


Working inline assembly in C for bit parity?

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.
Currently I'm using the following function (by #Troyseph, found here):
uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111) * 0x1111111111111111;
return (n >> 60) & 1;
The same SO page has the following assembly routine (by #papadp):
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
CheckParity ENDP
which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).
Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)
In case it's of any help, the parity64() function is called inside the following routine:
uint bindot(uint64* a, uint64* b, uint64 entries){
uint parity = 0;
for(uint i=0; i<entries; ++i)
parity ^= parity64(a[i] & b[i]); // Running sum!
return parity;
(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)
This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
CheckParity ENDP
will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:
xor eax, eax
test ecx, ecx ; logically, should use RCX, but see below for behavior of PF
jnp jmp_over
mov eax, 1 ; or possibly even "inc eax"; would need to verify
Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:
xor eax, eax
test ecx, ecx
setp al
Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):
xor eax, eax
test ecx, ecx
mov ecx, 1
cmovp eax, ecx
And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.
There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.
In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.
As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:
uint32 SwapBytes(uint32 x)
return ((x << 24) & 0xff000000 ) |
((x << 8) & 0x00ff0000 ) |
((x >> 8) & 0x0000ff00 ) |
((x >> 24) & 0x000000ff );
The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?
I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:
Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?
In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:
Have I ensured that my "optimized" code is returning the correct results?
This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.
I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.
For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.
…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.
Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):
Benchmark Time CPU Iterations
Naive 36 ns 36 ns 19478261
OriginalCCode 4 ns 4 ns 194782609
Ajay_Brahmakshatriya_Tweaked 4 ns 4 ns 194782609
Shreyas_Shivalkar 37 ns 37 ns 17920000
TypeIA 5 ns 5 ns 154482759
TypeIA_Tweaked 4 ns 4 ns 160000000
has_even_parity 227 ns 229 ns 3200000
has_even_parity_Tweaked 36 ns 36 ns 19478261
GCC_builtin_parityll 4 ns 4 ns 186666667
PopCount 3 ns 3 ns 248888889
PopCount_Downlevel 5 ns 5 ns 100000000
Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)
Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)
The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:
unsigned int Naive(uint64 n)
bool parity = false;
while (n)
parity = !parity;
n &= (n - 1);
return parity;
OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:
Ajay_Brahmakshatriya_Tweaked PROC
mov rax, rcx ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
shr rax, 32
xor rcx, rax
mov rax, rcx
shr rax, 16
xor rcx, rax
mov rax, rcx
shr rax, 8
xor eax, ecx ; Ped7g's TEST is redundant; XOR already sets PF
setnp al
movzx eax, al
Ajay_Brahmakshatriya_Tweaked ENDP
The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:
Shreyas_Shivalkar PROC
; unsigned int parity = 0;
; while (x != 0)
; {
; parity ^= x;
; x >>= 1;
; }
; return (parity & 0x1);
xor eax, eax
test rcx, rcx
je SHORT Finished
xor eax, ecx
shr rcx, 1
jne SHORT Process
and eax, 1
Shreyas_Shivalkar ENDP
TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.
unsigned int TypeIA(uint64 n)
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n ^= n >> 2;
n ^= n >> 1;
return !((~n) & 1);
unsigned int TypeIA_Tweaked(uint64 n)
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n &= 0xf;
return ((0x6996 >> n) & 1);
has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.
unsigned int has_even_parity(uint64 n)
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
if (n & (b << i)) { ++count; }
return (count % 2);
has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:
unsigned int has_even_parity_Tweaked(uint64 n)
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
count += static_cast<int>(static_cast<bool>(n & (b << i)));
return (count % 2);
Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.
GCC_builtin_parityll PROC ; GCC's __builtin_parityll
mov edx, ecx
shr rcx, 32
xor edx, ecx
mov eax, edx
shr edx, 16
xor eax, edx
xor al, ah
setnp al
movzx eax, al
GCC_builtin_parityll ENDP
PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.
(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)
PopCount PROC
xor eax, eax ; break false dependency
popcnt rax, rcx
and eax, 1
PopCount ENDP
In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.
But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.
PopCount_Downlevel PROC
mov rax, rcx
shr rax, 1
mov rdx, 5555555555555555h
and rax, rdx
sub rcx, rax
mov rax, 3333333333333333h
mov rdx, rcx
and rcx, rax
shr rdx, 2
and rdx, rax
add rdx, rcx
mov rcx, 0FF0F0F0F0F0F0F0Fh
mov rax, rdx
shr rax, 4
add rax, rdx
mov rdx, 0FF01010101010101h
and rax, rcx
imul rax, rdx
shr rax, 56
and eax, 1
PopCount_Downlevel ENDP
As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:
unsigned int PopCount_Downlevel(uint64 n)
uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
temp = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
temp = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
temp = (temp * 0x0101010101010101ULL) >> 56;
return (temp & 1);
But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.
Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.
Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:
You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.
Your parity64 function can be changed as follows -
uint parity64_unsafe_and_broken(uint64 n){
uint result = 0;
__asm__("addq $0, %0" : : "r"(n) :);
// editor's note: compiler-generated instructions here can destroy EFLAGS
// Don't depending on FLAGS / regs surviving between asm statements
// also, jumping out of an asm statement safely requires asm goto
__asm__("jnp 1f");
__asm__("movl $1, %0" : "=r"(result) : : );
return result;
But as commented by #MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.
To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
This code will have to be just at the start of the function before the assembly.
You will have to check how it affects the performance.
The most optimized I could get it is
uint parity64(uint64 n){
unsigned char result = 0;
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
__asm__("test %1, %1 \n\t"
"setp %0"
: "+r"(result)
: "r"(n)
return result;
How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.
You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.
In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.
You can compile this to an object file like so: gcc -c parity64.c -o parity64.o
... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot
In parity64_x86.s you might have the following assembly code from your question:
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
CheckParity ENDP
You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:
bool CheckParity(size_t result) {
return __builtin_parityl(result);
To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c
Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:
extern bool CheckParity(size_t Result);
uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
uint64_t parity = 0;
for(size_t i = 0; i < entries; ++i)
parity ^= a[i] & b[i]; // Running sum!
return CheckParity(parity);
You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...
I highly recommend reading the manual for your compiler, when you have the time...

SIMD (AVX2) mask store and pack [duplicate]

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
// Move 4 sign bits of mask to 4-bit integer value.
int mask = _mm_movemask_ps(mask);
// Select shuffle control data
__m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
// Permute to move valid values to front of SIMD register
__m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
return packed;
This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).
I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.
I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..
Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?
Here is an illustration of the Left Packing Problem from the above document:
AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)
We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.
We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.
Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).
AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)
For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.
For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)
For 16-bit elements, you might be able to adapt this with 128-bit vectors.
For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.
The algorithm:
Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.
Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])
See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.
Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.
By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)
We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.
See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.
#include <stdint.h>
#include <immintrin.h>
// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101); // unpack each bit to a byte
expanded_mask *= 0xFF; // mask |= mask<<1 | mask<<2 | ... | mask<<7;
// ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte
const uint64_t identity_indices = 0x0706050403020100; // the identity shuffle for vpermps, packed to one index per byte
uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);
__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);
return _mm256_permutevar8x32_ps(src, shufmask);
This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).
# clang 3.7.1 -std=gnu++14 -O3 -march=haswell
mov eax, edi # just to zero extend: goes away when inlining
movabs rcx, 72340172838076673 # The constants are hoisted after inlining into a loop
pdep rax, rax, rcx # ABC -> 0000000A0000000B....
imul rax, rax, 255 # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
movabs rcx, 506097522914230528
pext rax, rcx, rax
vmovq xmm1, rax
vpmovzxbd ymm1, xmm1 # 3c latency since this is lane-crossing
vpermps ymm0, ymm1, ymm0
(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)
So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.
This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.
e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.
clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).
gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.
Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.
If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.
You can unpack LUT entries with the same integer sequence I used, but #Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).
See my other answer for AVX2+BMI2 with no LUT.
Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:
VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.
The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".
To figure out how far to advance your pointer for the next vector, popcnt the mask.
Let's say you want to filter out everything but values >= 0 from an array:
#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
const float *endp = src+len;
float *dst_start = dst;
do {
__m512 sv = _mm512_loadu_ps(src);
__mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ); // true for src >= 0.0, false for unordered and src < 0.0
_mm512_mask_compressstoreu_ps(dst, keep, sv); // clang is missing this intrinsic, which can't be emulated with a separate store
src += 16;
dst += _mm_popcnt_u64(keep); // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
} while (src < endp);
return dst - dst_start;
This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):
# Output from gcc6.1, with -O3 -march=haswell -mavx512f. Same with other gcc versions
lea rcx, [rsi+rdx*4] # endp
mov rax, rdi
vpxord zmm1, zmm1, zmm1 # vpxor xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
vmovups zmm0, ZMMWORD PTR [rsi]
add rsi, 64
vcmpps k1, zmm0, zmm1, 29 # AVX512 compares have mask regs as a destination
kmovw edx, k1 # There are some insns to add/or/and mask regs, but not popcnt
movzx edx, dx # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
vcompressps ZMMWORD PTR [rax]{k1}, zmm0
popcnt rdx, rdx
## movsx rdx, edx # with _popcnt_u32, gcc is dumb. No casting can get gcc to do anything but sign-extend. You'd expect (unsigned) would mov to zero-extend, but no.
lea rax, [rax+rdx*4] # dst += ...
cmp rcx, rsi
ja .L2
sub rax, rdi
sar rax, 2 # address math -> element count
Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake
In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).
#ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)
I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.
Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.
It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.
Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.
A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.
This is likely slower than vcompressps with unaligned stores on current Intel CPUs.
If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).
I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.
Here is the intrinsics version, along with code to build LUT.
//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
__m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT
// __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));
//now shift it right to get 3 bits at bottom
//__m256i shufmask = _mm256_srli_epi32(m, 29);
//Simplified version suggested by wim
//shift each lane so desired 3 bits are a bottom
//There is leftover data in the lane, but _mm256_permutevar8x32_ps only examines the first 3 bits so this is ok
__m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
return shufmask;
u32 get_nth_bits(int a) {
u32 out = 0;
int c = 0;
for (int i = 0; i < 8; ++i) {
auto set = (a >> i) & 1;
if (set) {
out |= (i << (c * 3));
return out;
u8 g_pack_left_table_u8x3[256 * 3 + 1];
void BuildPackMask() {
for (int i = 0; i < 256; ++i) {
*reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
Here is the assembly generated by MSVC:
lea ecx, DWORD PTR [rcx+rcx*2]
lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
vpbroadcastd ymm0, DWORD PTR [rcx+rax]
vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm#00000015000000120000000f0000000c00000009000000060000000300000000
Will add more information to a great answer from #PeterCordes : https://stackoverflow.com/a/36951611/5021064.
I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.
I ended up with two meaningful variations on the proposed solution:
__m128i registers, any element type, using _mm_shuffle_epi8 instruction
__m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32
When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.
Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t
NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.
Implementations of #PeterCordes answer for both cases
Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.
Mask for _mm_shuffle_epi8
Write an index for each byte into a half byte: 0xfedcba9876543210
Get pairs of indexes into 8 shorts packed into __m128i
Spread them out using x << 4 | x & 0x0f0f
Example of spreading the indexes. Let's say 7th and 6th elements are picked.
It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of `_mm_movemask_epi8`,
// `uint16_t` - there are at most 16 bits with values for __m128i.
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;
const std::uint8_t offset =
static_cast<std::uint8_t>(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes =
_pext_u64(0xfedcba9876543210, mmask_expanded); // Do the #PeterCordes answer
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte); // From bytes to shorts over the whole register
const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4); // x << 4
const __m128i combined = _mm_or_si128(shift_by_4, as_16bit); // | x
const __m128i filter = _mm_set1_epi16(0x0f0f); // 0x0f0f
const __m128i res = _mm_and_si128(combined, filter); // & 0x0f0f
return {res, offset};
} // namespace _compress_mask
template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
auto res = _compress_mask::mask128(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
Mask for _mm256_permutevar8x32_epi32
This is almost one for one #PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).
The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.
Complete mask code:
// helper namespace
namespace _compress_mask {
// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;
const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask)); // To compute how many elements were selected
const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded); // Do the #PeterCordes answer
// Every index was one byte => we need to make them into 4 bytes
const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0000|compressed indexes
const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte); // spread them out
return {expanded, offset};
} // namespace _compress_mask
template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
static_assert(sizeof(T) >= 4); // You cannot permute shorts/chars with this.
auto res = _compress_mask::mask256_epi32(mmask);
res.second /= sizeof(T); // bit count to element count
return res;
Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark
Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline
Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.
Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.
40 bytes worth of data, 40 chars
For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.
1000 bytes worth of data, 1000 chars
Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.
1000 bytes worth of data, 500 shorts
Similar results for shorts except with a much smaller gain - up to 2 times.
I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.
1000 bytes worth of data, 250 ints
For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).
10'000 bytes worth of data, 10'000 chars
The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.
Same plots, only simd versions:
Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).
10'000 bytes worth of data, 5'000 shorts
From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.
10'000 bytes worth of data, 2'500 ints
Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.
Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512
On percentage of elements to remove
On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.
Code alignment impact
The question is: how much worth it is, if the code happens to be poorly aligned
(generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).
Code alignment impact - non-simd
From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).
For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected.
Other then that - from 10% to 1.6-1.8 times worth
Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.
Code alignment impact - simd versions
Not showing shorts and ints 128, since it's almost the same assembly as for chars
Chars - 128 bit register
About 1.2 times slower
Chars - 256 bit register
About 1.1 - 1.24 times slower
Ints - 256 bit register
1.25 - 1.35 times slower
We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.
In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.
Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.
static inline __m128 LeftPack_SSE2(__m128 val, int mask) {
switch(mask) {
case 0:
case 1: return val;
case 2: return _mm_shuffle_ps(val,val,0x01);
case 3: return val;
case 4: return _mm_shuffle_ps(val,val,0x02);
case 5: return _mm_shuffle_ps(val,val,0x08);
case 6: return _mm_shuffle_ps(val,val,0x09);
case 7: return val;
case 8: return _mm_shuffle_ps(val,val,0x03);
case 9: return _mm_shuffle_ps(val,val,0x0c);
case 10: return _mm_shuffle_ps(val,val,0x0d);
case 11: return _mm_shuffle_ps(val,val,0x34);
case 12: return _mm_shuffle_ps(val,val,0x0e);
case 13: return _mm_shuffle_ps(val,val,0x38);
case 14: return _mm_shuffle_ps(val,val,0x39);
case 15: return val;
__m128 foo(__m128 val, __m128 maskv) {
int mask = _mm_movemask_ps(maskv);
return LeftPack_SSE2(val, mask);
This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:
inline __m128 left_pack(__m128 val, __m128i mask) noexcept
const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);
__m128 v = val;
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
return v;
Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.
If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.
And if you're using double-precision, here's an additional version for AVX2:
inline __m256 left_pack(__m256d val, __m256i mask) noexcept
const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);
__m256d v = val;
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);
return v;
Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

My attempt to optimize memset on a 64bit machine takes more time than standard implementation. Can someone please explain why?

(machine is x86 64 bit running SL6)
I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?
/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
void memset8(unsigned char *dest, unsigned char val, uint32_t count)
while (count--)
*dest++ = val;
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
while (count--)
*dest++ = val;
memset64(uint64_t *dest, uint64_t val, uint32_t count)
while (count--)
*dest++ = val;
#define CYCLES 1000000000
int main()
clock_t start, end;
double total;
uint64_t loop;
uint64_t val;
/* memset 32 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
memset32((uint32_t*)&val, 0, 2);
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset32 %g\n", total);
/* memset 64 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
memset64(&val, 0, 1);
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset64 %g\n", total);
/* memset 8 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
memset8((unsigned char*)&val, 0, 8);
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset8 %g\n", total);
/* memset */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
memset(&val, 0, 8);
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset %g\n", total);
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
Looks like the standard memset is more optimized than my implementation.
I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset.
Can someone suggest what am I doing wrong ?
Actual memset implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.
As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:
.align 4,0x90
1: movdqa %xmm0, (%rdi,%rcx)
movdqa %xmm0, 16(%rdi,%rcx)
movdqa %xmm0, 32(%rdi,%rcx)
movdqa %xmm0, 48(%rdi,%rcx)
addq $64, %rcx
jne 1b
This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64 cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos can go even faster and achieve 32B/cycle.
As per my understanding memset goes byte by byte and sets the value.
The details of what the memset facility does are implementation dependent. Relying on this is usually a good thing, because the I'm sure the implementors have extensive knowledge of the system and know all kind of techniques to make things as fast as possible.
To elaborate a little more, lets look at:
memset(&val, 0, 8);
When the compiler sees this it can notice a few things like:
The fill value is 0
The number of bytes to fill is 8
and then choose the right instructions to use depending on where val or &val is (in a register, in memory, ...). But if memset is stuck needing to be a function call (like your implementations), none of those optimizations are possible. Even if it can't make compile time decisions like:
memset(&val, x, y); // no way to tell at compile time what x and y will be...
you can be assured that there's a function call written in assembler that will be as fast as possible for your platform.
I think it's worth exploring how to write a faster memset particularly with GCC (which I assume you are using with Scientific Linux 6) in C/C++. Many people assume the standard implementation is optimized. This is not necessarily true. If you see table 2.1 of Agner Fog's Optimizing Software in C++ manuals he compares memcpy for for several different compilers and platforms to his own assembly optimized version of memcpy. Memcpy in GCC at the time really underperformed (but the Mac version was good). He claims the built in functions are even worse and recommends using -no-builtin. GCC in my experience is very good at optimizing code but its library functions (and built in functions) are not very optimized (with ICC it's the other way around).
It would be interesting to see how good you could do using intrinsics. If you look at his asmlib you can see how he implements memset with SSE and AVX (it would be interesting to compare this to the Apple's optimized version Stephen Canon posted).
With AVX you can see he writes 32 bytes at a time.
K100: ; Loop through 32-bytes blocks. Register use is swapped
; Rcount = end of 32-bytes blocks part
; Rdest = negative index from the end, counting up to zero
vmovaps [Rcount+Rdest], ymm0
add Rdest, 20H
jnz K100
vmovaps in this case is the same as the intrinsic _mm256_store_ps. Maybe GCC has improved since then but you might be able to beat GCC's implementation of memset using intrinsics. If you don't have AVX you certainly have SSE (all x86 64bit do) so you could look at the SSE version of his code to see what you could do.
Here is a start for your memset32 funcion assuming the array fits in the L1 cache. If the array does not fit in the cache you want to do a non temporal store with _mm256_stream_ps. For a general function you need several cases also including cases when the memory is not 32 byte aligned.
#include <immintrin.h>
int main() {
int count = (1<<14)/sizeof(int);
int* dest = (int*)_mm_malloc(sizeof(int)*count, 32); // 32 byte aligned
__m256 val8 = _mm256_castsi256_ps(_mm256_set1_epi32(val));
for(int i=0; i<count; i+=8) {
_mm256_store_ps((float*)(dest+i), val8);

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
double c1, c2;
c1 = get_counter();
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
push 1
call sleep
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
asm ("rdtsc"
: "=A"(ret)
return ret;
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

What is the fastest way to convert float to int on x86

What is the fastest way you know to convert a floating-point number to an int on an x86 CPU. Preferrably in C or assembly (that can be in-lined in C) for any combination of the following:
32/64/80-bit float -> 32/64-bit integer
I'm looking for some technique that is faster than to just let the compiler do it.
It depends on if you want a truncating conversion or a rounding one and at what precision. By default, C will perform a truncating conversion when you go from float to int. There are FPU instructions that do it but it's not an ANSI C conversion and there are significant caveats to using it (such as knowing the FPU rounding state). Since the answer to your problem is quite complex and depends on some variables you haven't expressed, I recommend this article on the issue:
Packed conversion using SSE is by far the fastest method, since you can convert multiple values in the same instruction. ffmpeg has a lot of assembly for this (mostly for converting the decoded output of audio to integer samples); check it for some examples.
A commonly used trick for plain x86/x87 code is to force the mantissa part of the float to represent the int. 32 bit version follows.
The 64-bit version is analogical. The Lua version posted above is faster, but relies on the truncation of double to a 32-bit result, therefore it requires the x87 unit to be set to double precision, and cannot be adapted for double to 64-bit int conversion.
The nice thing about this code is it is completely portable for all platforms conforming to IEEE 754, the only assumption made is the floating point rounding mode is set to nearest. Note: Portable in the sense it compiles and works. Platforms other than x86 usually do not benefit much from this technique, if at all.
static const float Snapper=3<<22;
union UFloatInt {
int i;
float f;
/** by Vlad Kaipetsky
portable assuming FP24 set to nearest rounding mode
efficient on x86 platform
inline int toInt( float fval )
Assert( fabs(fval)<=0x003fffff ); // only 23 bit values handled
UFloatInt &fi = *(UFloatInt *)&fval;
fi.f += Snapper;
return ( (fi.i)&0x007fffff ) - 0x00400000;
There is one instruction to convert a floating point to an int in assembly: use the FISTP instruction. It pops the value off the floating-point stack, converts it to an integer, and then stores at at the address specified. I don't think there would be a faster way (unless you use extended instruction sets like MMX or SSE, which I am not familiar with).
Another instruction, FIST, leaves the value on the FP stack but I'm not sure it works with quad-word sized destinations.
If you can guarantee the CPU running your code is SSE3 compatible (even Pentium 5 is, JBB), you can allow the compiler to use its FISTTP instruction (i.e. -msse3 for gcc). It seems to do the thing like it should always have been done:
Note that FISTTP is different from FISTP (that has its problems, causing the slowness). It comes as part of SSE3 but is actually (the only) X87-side refinement.
Other then X86 CPU's would probably do the conversion just fine, anyways. :)
Processors with SSE3 support
The Lua code base has the following snippet to do this (check in src/luaconf.h from www.lua.org).
If you find (SO finds) a faster way, I'm sure they'd be thrilled.
Oh, lua_Number means double. :)
## lua_number2int is a macro to convert lua_Number to int.
## lua_number2integer is a macro to convert lua_Number to lua_Integer.
** CHANGE them if you know a faster way to convert a lua_Number to
** int (with any rounding method and without throwing errors) in your
** system. In Pentium machines, a naive typecast from double to int
** in C is extremely slow, so any alternative is worth trying.
/* On a Pentium, resort to a trick */
#if defined(LUA_NUMBER_DOUBLE) && !defined(LUA_ANSI) && !defined(__SSE2__) && \
(defined(__i386) || defined (_M_IX86) || defined(__i386__))
/* On a Microsoft compiler, use assembler */
#if defined(_MSC_VER)
#define lua_number2int(i,d) __asm fld d __asm fistp i
#define lua_number2integer(i,n) lua_number2int(i, n)
/* the next trick should work on any Pentium, but sometimes clashes
with a DirectX idiosyncrasy */
union luai_Cast { double l_d; long l_l; };
#define lua_number2int(i,d) \
{ volatile union luai_Cast u; u.l_d = (d) + 6755399441055744.0; (i) = u.l_l; }
#define lua_number2integer(i,n) lua_number2int(i, n)
/* this option always works, but may be slow */
#define lua_number2int(i,d) ((i)=(int)(d))
#define lua_number2integer(i,d) ((i)=(lua_Integer)(d))
I assume truncation is required, same as if one writes i = (int)f in "C".
If you have SSE3, you can use:
int convert(float x)
int n;
__asm {
fld x
fisttp n // the extra 't' means truncate
return n;
Alternately, with SSE2 (or in x64 where inline assembly might not be available), you can use almost as fast:
#include <xmmintrin.h>
int convert(float x)
return _mm_cvtt_ss2si(_mm_load_ss(&x)); // extra 't' means truncate
On older computers there is an option to set the rounding mode manually and perform conversion using the ordinary fistp instruction. That will probably only work for arrays of floats, otherwise care must be taken to not use any constructs that would make the compiler change rounding mode (such as casting). It is done like this:
void Set_Trunc()
// cw is a 16-bit register [_ _ _ ic rc1 rc0 pc1 pc0 iem _ pm um om zm dm im]
__asm {
push ax // use stack to store the control word
fnstcw word ptr [esp]
fwait // needed to make sure the control word is there
mov ax, word ptr [esp] // or pop ax ...
or ax, 0xc00 // set both rc bits (alternately "or ah, 0xc")
mov word ptr [esp], ax // ... and push ax
fldcw word ptr [esp]
pop ax
void convertArray(int *dest, const float *src, int n)
__asm {
mov eax, src
mov edx, dest
mov ecx, n // load loop variables
cmp ecx, 0
je bottom // handle zero-length arrays
fld dword ptr [eax]
fistp dword ptr [edx]
loop top // decrement ecx, jump to top
Note that the inline assembly only works with Microsoft's Visual Studio compilers (and maybe Borland), it would have to be rewritten to GNU assembly in order to compile with gcc.
The SSE2 solution with intrinsics should be quite portable, however.
Other rounding modes are possible by different SSE2 intrinsics or by manually setting the FPU control word to a different rounding mode.
If you really care about the speed of this make sure your compiler is generating the FIST instruction. In MSVC you can do this with /QIfist, see this MSDN overview
You can also consider using SSE intrinsics to do the work for you, see this article from Intel: http://softwarecommunity.intel.com/articles/eng/2076.htm
Since MS scews us out of inline assembly in X64 and forces us to use intrinsics, I looked up which to use. MSDN doc gives _mm_cvtsd_si64x with an example.
The example works, but is horribly inefficient, using an unaligned load of 2 doubles, where we need just a single load, so getting rid of the additional alignment requirement. Then a lot of needless loads and reloads are produced, but they can be eliminated as follows:
#include <intrin.h>
#pragma intrinsic(_mm_cvtsd_si64x)
long long _inline double2int(const double &d)
return _mm_cvtsd_si64x(*(__m128d*)&d);
000000013F651085 cvtsd2si rax,mmword ptr [rsp+38h]
000000013F65108C mov qword ptr [rsp+28h],rax
The rounding mode can be set without inline assembly, e.g.
where rounding to nearest is default (anyway).
The question whether to set the rounding mode at each call or to assume it will be restored (third party libs) will have to be answered by experience, I guess.
You will have to include float.h for _control87() and related constants.
And, no, this will not work in 32 bits, so keep using the FISTP instruction:
_asm fld d
_asm fistp i
Generally, you can trust the compiler to be efficient and correct. There is usually nothing to be gained by rolling your own functions for something that already exists in the compiler.
