Decimal concatenation of two integers using bit operations

Decimal concatenation of two integers using bit operations - c

I want to concatenate two integers using only bit operations since i need efficiency as much as possible.There are various answers available but they are not fast enough what I want is implementation that uses only bit operations like left shift,or and etc.
Please guide me how to do it.
example
int x=32;
int y=12;
int result=3212;
I am having and FPGA implentation of AES.I need this on my system to reduce time consumption of some task

The most efficient way to do it, is likely something similar to this:
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
uint32_t mult=1;
do
{
mult *= 10;
} while(mult <= ls);
return ms * mult + ls;
}
Then let the compiler worry about optimization. There's likely not a lot it can improve since this is base 10, which doesn't get along well with the various instructions of the computer, like shifting.
EDIT : BENCHMARK TEST
Intel i7-3770 2 3,4 GHz
OS: Windows 7/64
Mingw, GCC version 4.6.2
gcc -O3 -std=c99 -pedantic-errors -Wall
10 million random values, from 0 to 3276732767.
Result (approximates):
Algorithm 1: 60287 micro seconds
Algorithm 2: 65185 micro seconds
Benchmark code used:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#include <time.h>
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
uint32_t mult=1;
do
{
mult *= 10;
} while(mult <= ls);
return ms * mult + ls;
}
uint32_t myConcat (uint32_t a, uint32_t b) {
switch( (b >= 10000000) ? 7 :
(b >= 1000000) ? 6 :
(b >= 100000) ? 5 :
(b >= 10000) ? 4 :
(b >= 1000) ? 3 :
(b >= 100) ? 2 :
(b >= 10) ? 1 : 0 ) {
case 1: return a*100+b; break;
case 2: return a*1000+b; break;
case 3: return a*10000+b; break;
case 4: return a*100000+b; break;
case 5: return a*1000000+b; break;
case 6: return a*10000000+b; break;
case 7: return a*100000000+b; break;
default: return a*10+b; break;
}
}
static LARGE_INTEGER freq;
static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end)
{
LARGE_INTEGER elapsed;
elapsed.QuadPart = end->QuadPart - start->QuadPart;
elapsed.QuadPart *= 1000000;
elapsed.QuadPart /= freq.QuadPart;
printf("%lu micro seconds", elapsed.QuadPart);
}
int main()
{
const uint32_t TEST_N = 10000000;
uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N);
uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N);
volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N);
volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N);
srand (time(NULL));
// Mingw rand() apparently gives numbers up to 32767
// worst case should therefore be 3,276,732,767
// fill up random data in arrays
for(uint32_t i=0; i<TEST_N; i++)
{
data1[i] = rand();
data2[i] = rand();
}
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start, end;
// run algorithm 1
QueryPerformanceCounter(&start);
for(uint32_t i=0; i<TEST_N; i++)
{
result_algo1[i] = uintcat(data1[i], data2[i]);
}
QueryPerformanceCounter(&end);
// print results
printf("Algorithm 1: ");
print_benchmark_results(&start, &end);
printf("\n");
// run algorithm 2
QueryPerformanceCounter(&start);
for(uint32_t i=0; i<TEST_N; i++)
{
result_algo2[i] = myConcat(data1[i], data2[i]);
}
QueryPerformanceCounter(&end);
// print results
printf("Algorithm 2: ");
print_benchmark_results(&start, &end);
printf("\n\n");
// sanity check both algorithms against each other
for(uint32_t i=0; i<TEST_N; i++)
{
if(result_algo1[i] != result_algo2[i])
{
printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n",
data1[i],
data2[i],
data1[i],
data2[i],
result_algo1[i],
result_algo2[i]);
}
}
// clean up
free((void*)data1);
free((void*)data2);
free((void*)result_algo1);
free((void*)result_algo2);
}

Bit operations use the binary representation of the numbers. However what you try to achieve is to concatenate the numbers in decimal notation. Please note that concatenating the decimal representations has little to do with concatenating the binary representations. Though it is theoretically possible to solve the problem using binary operations I am sure it will be far from the most efficient way.

We need to calculate a*10^N + b very fast.
Bit operations isn't a best idea to optimize it (even using tricks like a := (a<<1) + (a<<3) <=> a := a*10 as compiler can make it himself).
The first problem is to calculate 10^N, but there is no need to calculate it, there is just 9 possible values.
The second problem is to calculate N from b (length of 10 representation). If your data have uniform distribution you can minimize operations count in average case.
Check b <= 10^9, b <= 10^8, ..., b <= 10 with ()?: (it's faster than if( ) after optimizations, it has much simplier grammar and functionality), call the result N. Next, make switch(N) with lines "return a*10^N + b" (where 10^N is constant). As I know, switch() with 3-4 "case" is faster than same if( ) construction after optimizations.
unsigned int myConcat(unsigned int& a, unsigned int& b) {
switch( (b >= 10000000) ? 7 :
(b >= 1000000) ? 6 :
(b >= 100000) ? 5 :
(b >= 10000) ? 4 :
(b >= 1000) ? 3 :
(b >= 100) ? 2 :
(b >= 10) ? 1 : 0 ) {
case 1: return a*100+b; break;
case 2: return a*1000+b; break;
case 3: return a*10000+b; break;
case 4: return a*100000+b; break;
case 5: return a*1000000+b; break;
case 6: return a*10000000+b; break;
case 7: return a*100000000+b; break;
default: return a*10+b; break;
// I don't really know what to do here
//case 8: return a*1000*1000*1000+b; break;
//case 9: return a*10*1000*1000*1000+b; break;
}
}
As you can see, there is 2-3 operations in average case + optimisations is very effective here. I've benchmarked it in comparison with Lundin's suggestion, here is the result. 0ms vs 100ms

If you care about decimal digit concatenation, you might want to simply do that as you're printing, and convert both numbers to a sequence of digits sequentially. e.g. How do I print an integer in Assembly Level Programming without printf from the c library? shows an efficient C function, as well as asm. Call it twice into the same buffer.
#Lundin's answer loops increasing powers of 10 to find
the right decimal-shift, i.e. a linear search for the right power of 10. If it's called very frequently so a lookup table can stay hot in cache, a speedup maybe be possible.
If you can use GNU C __builtin_clz (Count Leading Zeros) or some other way of quickly finding the MSB position of the right-hand input (ls, the least-significant part of the resulting concatenation), you can start the search for the right mult from a 32-entry lookup table. (And you only have to check at most one more iteration, so it's not a loop.)
Most of the common modern CPU architectures have a HW instruction that the compiler can use directly or with a bit of processing to implement clz. https://en.wikipedia.org/wiki/Find_first_set#Hardware_support. (And on all but x86, the result is well-defined for an input of 0, but unfortunately GNU C doesn't portably give us access to that.)
If the table stays hot in L1d cache, this can be pretty good. The extra latency of a clz and a table lookup are comparable to a couple iterations of the loop (on a modern x86 like Skylake or Ryzen for example, where bsf or tzcnt is 3 cycle latency, L1d latency is 4 or 5 cycles, imul latency is 3 cycles.)
Of course, on many architectures (including x86), multiplying by 10 is cheaper than by a runtime variable, using shift and add. 2 LEA instructions on x86, or an add+lsl on ARM/AArch64 using a shifted input to do tmp = x + x*4 with the add. So on Intel CPUs, we're only looking at a 2-cycle loop-carried dependency chain, not 3. But AMD CPUs have slower LEA when using a scaled index.
This doesn't sound great for small numbers. But it can reduce branch mispredictions by needing at most one iteration. It even makes a branchless implementation possible. And it means less total work for large lower parts (large powers of 10). But large integers will easily overflow unless you use a wider result type.
Unfortunately, 10 is not a power of 2, so the MSB position alone can't give us the exact power of 10 to multiply by. e.g. all numbers from 64 to 127 all have MSB = 1<<7, but some of them have 2 decimal digits and some have 3. Since we want to avoid division (because it requires a multiplication by a magic constant and shifting the high half), we want to always start with the lower power of 10 and see if that's big enough.
But fortunately, a bitscan does get us within one power of 10 so we no longer need a loop.
I probably wouldn't have written the part with _lzcnt_u32 or ARM __clz if I'd learned of the clz(a|1) trick for avoiding problems with input=0 beforehand. But I did, and played around with the source a bit to try to get nicer asm from gcc and clang. Index the table on clz or BSR depending on target platform makes it a bit of a mess.
#include <stdint.h>
#include <limits.h>
#include <assert.h>
// builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0
// actual x86 HW leaves the destination register unmodified; AMD even documents this.
// but GNU C doesn't let us take advantage with intrinsics.
// unless you use BMI1 _lzcnt_u32
// if available, use an intrinsic that gives us a leading-zero count
// *without* an undefined result for input=0
#ifdef __LZCNT__ // x86 CPU feature
#include <immintrin.h> // Intel's intrinsics
#define HAVE_LZCNT32
#define lzcnt32(a) _lzcnt_u32(a)
#endif
#ifdef __ARM__ // TODO: do older ARMs not have this?
#define HAVE_LZCNT32
#define lzcnt32(a) __clz(a) // builtin, no header needed
#endif
// Some POWER compilers define `__cntlzw`?
// index = msb position, or lzcnt, depending on which the HW can do more efficiently
// defined later; one or the other is unused and optimized out, depending on target platform
// alternative: fill this at run-time startup
// with a loop that does mult*=10 when (x<<1)-1 > mult, or something
//#if INDEX_BY_MSB_POS == 1
__attribute__((unused))
static const uint32_t catpower_msb[] = {
10, // 1 and 0
10, // 2..3
10, // 4..7
10, // 8..15
100, // 16..31 // 2 digits even for the low end of the range
100, // 32..63
100, // 64..127
1000, // 128..255 // 3 digits
1000, // 256..511
1000, // 512..1023
10000, // 1024..2047
10000, // 2048..4095
10000, // 4096..8191
10000, // 8192..16383
100000, // 16384..32767
100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs
// ... // fill in the rest yourself
};
//#elif INDEX_BY_MSB_POS == 0
// index on leading zeros
__attribute__((unused))
static const uint32_t catpower_lz32[] = {
// top entries overflow: 10^10 doesn't fit in uint32_t
// intentionally wrong to make it easier to spot bad output.
4000000000, // 2^31 .. 2^32-1 2*10^9 .. 4*10^9
2000000000, // 1,073,741,824 .. 2,147,483,647
// first correct entry
1000000000, // 536,870,912 .. 1,073,741,823
// ... fill in the rest
// for testing, skip until 16 leading zeros
[16] = 100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs
100000, // 16384..32767
10000, // 8192..16383
10000, // 4096..8191
10000, // 2048..4095
10000, // 1024..2047
1000, // 512..1023
1000, // 256..511
1000, // 128..255
100, // 64..127
100, // 32..63
100, // 16..31 // low end of the range has 2 digits
10, // 8..15
10, // 4..7
10, // 2..3
10, // 1
// lzcnt32(0) == 32
10, // 0 // treat 0 as having one significant digit.
};
//#else
//#error "INDEX_BY_MSB_POS not set correctly"
//#endif
//#undef HAVE_LZCNT32 // codegen for the other path, for fun
static inline uint32_t msb_power10(uint32_t a)
{
#ifdef HAVE_LZCNT32 // 0-safe lzcnt32 macro available
#define INDEX_BY_MSB_POS 0
// a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes
unsigned lzcnt = lzcnt32(a); // 32 for a=0
return catpower_lz32[lzcnt];
#else
// only generic __builtin_clz available
static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit");
// See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html
// for C++ templates for fixed-width wrappers for __builtin_clz
#if defined(__i386__) || defined(__x86_64__)
// x86 where MSB_index = 31-clz = BSR is most efficient
#define INDEX_BY_MSB_POS 1
unsigned msb = 31 - __builtin_clz(a|1); // BSR
return catpower_msb[msb];
//return unlikely(a==0) ? 10 : catpower_msb[msb];
#else
// use clz directly while still avoiding input=0
// I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width),
// but gcc's builtin is still documented as not valid for input=0
// Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index
// set the LSB to avoid the a==0 special case
unsigned clz = __builtin_clz(a|1);
// table[32] unused, could add yet another #ifdef for that
#define INDEX_BY_MSB_POS 0
//return unlikely(a==0) ? 10 : catpower_lz32[clz];
return catpower_lz32[clz]; // a|1 avoids the special-casing
#endif // optimize for BSR or not
#endif // HAVE_LZCNT32
}
uint32_t uintcat (uint32_t ms, uint32_t ls)
{
// if (ls==0) return ms * 10; // Another way to avoid the special case for clz
uint32_t mult = msb_power10(ls); // catpower[clz(ls)];
uint32_t high = mult * ms;
#if 0
if (mult <= ls)
high *= 10;
return high + ls;
#else
// hopefully compute both and then select
// because some CPUs can shift and add at the same time (x86, ARM)
// so this avoids having an ADD *after* the cmov / csel, if the compiler is smart
uint32_t another10 = high*10 + ls;
uint32_t enough = high + ls;
return (mult<=ls) ? another10 : enough;
#endif
}
From the Godbolt compiler explorer, this compiles efficiently for x86-64 with and without BSR:
# clang7.0 -O3 for x86-64 SysV, -march=skylake -mno-lzcnt
uintcat(unsigned int, unsigned int):
mov eax, esi
or eax, 1
bsr eax, eax # 31-clz(ls|1)
mov ecx, dword ptr [4*rax + catpower_msb]
imul edi, ecx # high = mult * ms
lea eax, [rdi + rdi]
lea eax, [rax + 4*rax] # retval = high * 10
cmp ecx, esi
cmova eax, edi # if(mult>ls) retval = high (drop the *10 result)
add eax, esi # retval += ls
ret
Or with lzcnt, (enabled by -march=haswell or later, or some AMD uarches),
uintcat(unsigned int, unsigned int):
# clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax
lzcnt eax, esi # C source avoids the |1, saving instructions
mov ecx, dword ptr [4*rax + catpower_lz32]
imul edi, ecx # same as above from here on
lea eax, [rdi + rdi]
lea eax, [rax + 4*rax]
cmp ecx, esi
cmova eax, edi
add eax, esi
ret
Factoring the last add out of both sides of the ternary is a missed optimization, adding 1 cycle of latency after the cmov. We can multiply by 10 and add just as cheaply as multiplying by 10 alone, on Intel CPUs:
... same start # hand-optimized version that clang should use
imul edi, ecx # high = mult * ms
lea eax, [rdi + 4*rdi] # high * 5
lea eax, [rsi + rdi*2] # retval = high * 10 + ls
add edi, esi # tmp = high + ls
cmp ecx, esi
cmova eax, edi # if(mult>ls) retval = high+ls
ret
So the high + ls latency would run in parallel with the high*10 + ls latency, both needed as inputs for cmov.
GCC branches instead of using CMOV for the last condition. GCC also makes a mess of 31-clz(a|1), calculating clz with BSR and XOR with 31. But then subtracting that from 31. And it has some extra mov instructions. Strangely, gcc seems to do better with that BSR code when lzcnt is available, even though it chooses not to use it.
clang has no trouble optimizing away the 31-clz double-inversion and just using BSR directly.
For PowerPC64, clang also makes branchless asm. gcc does something similar, but with a branch like on x86-64.
uintcat:
.Lfunc_gep0:
addis 2, 12, .TOC.-.Lfunc_gep0#ha
addi 2, 2, .TOC.-.Lfunc_gep0#l
ori 6, 4, 1 # OR immediate
addis 5, 2, catpower_lz32#toc#ha
cntlzw 6, 6 # CLZ word
addi 5, 5, catpower_lz32#toc#l # static table address
rldic 6, 6, 2, 30 # rotate left and clear immediate (shift and zero-extend the CLZ result)
lwzx 5, 5, 6 # Load Word Zero eXtend, catpower_lz32[clz]
mullw 3, 5, 3 # mul word
cmplw 5, 4 # compare mult, ls
mulli 6, 3, 10 # mul immediate
isel 3, 3, 6, 1 # conditional select high vs. high*10
add 3, 3, 4 # + ls
clrldi 3, 3, 32 # zero extend, clearing upper 32 bits
blr # return
Compressing the table
Using clz(ls|1) >> 1 or that +1 should work, because 4 < 10. The table always takes at least 3 entries to gain another digit. I haven't investigated this. (And have already spent longer than I meant to on this. :P)
Or right-shift a lot more to just get a starting point for the loop. e.g. mult = clz(ls) >= 18 ? 100000 : 10;, or a 3 or 4-way chain of if.
Or loop on mult *= 100, and after exiting that loop sort out whether you want old_mult * 10 or mult. (i.e. check if you went too far). This cuts the iteration count in half for even numbers of digits.
(Watch out for a possible infinite loop on large ls that would overflow the result. If mult *= 100 wraps to 0, it will always stay <= ls for ls = 1000000000, for example.)

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.
See update at end for finished code using some of the suggested answers.
Here's a sample implementation (more complete version on godbolt):
#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1, 6,7, 4,5
#define SWAP32 1, 0, 3, 2, 5, 4, 7, 6
static bool any(vNb x) {
x = x | __builtin_shufflevector(x,x, SWAP128);
x = x | __builtin_shufflevector(x,x, SWAP64);
x = x | __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
vNs max = {0,0,0,0,0,0,0,0};
vNs tmax;
unsigned imax = 0;
for (unsigned i = 0 ; i < n; i += VLEN) {
vNs t = *(vNs*)(data + i);
t = -t < t ? t : -t; // Absolute value
vNb cmp = t > max;
if (any(cmp)) {
tmax = t; imax = i;
// broadcast horizontal max of t into every element of max
vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
t = t < tswap128 ? tswap128 : t;
vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
t = t < tswap64 ? tswap64 : t;
vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
max = t < tswap32 ? tswap32 : t;
}
}
// To simplify example, ignore finding index of true value in tmax==max
*index = imax; // + which(tmax == max);
return max[0];
}
Code on godbolt allows changing VLEN to 8 or 4.
This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.
The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.
But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.
But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.
Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?
I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.
Update
I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.
While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".
Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.
Some notes on writing generic SIMD code.
The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).
static vNs vabs(vNs x) {
return -x < x ? x : -x;
}
This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.
template <typename v> // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
return a < b ? b : a; // compiles best with "<" as compare op
}
This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.
template <typename v>
static v hmax(v x) {
if (VLEN >= 8)
x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}
This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.
Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.
// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
vNb cidx = __builtin_convertvector(x, vNb);
return __builtin_ctzll((sNb)cidx) / 8u;
}

As commented by chtz, the most generic and typical method is to have another mask to gather indices:
Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr));
for (auto i = 8; i + 8 <= vec_length; i+=8) {
Vec8s data = abs(load8(ptr[i]));
auto mask = is_greater(data, max_abs);
max_idx = bitselect(mask, indices, max_idx);
max_abs = max(max_abs, data);
indices = indices + 8;
}
Another option is to interleave the values and indices:
auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);
for (int i = 8; i + 8 <= size; i+=8) {
idx = idx + 8;
auto d1 = load8s(ptr + i) & 0x7fffffff;
auto lo1 = zip_lo(idx, d1);
auto hi1 = zip_hi(idx, d1);
lo = max_u64(lo, lo1);
hi = max_u64(hi, hi1);
}
This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.
Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.
auto data0 = load8u(ptr) << 1; // take abs by shifting left
auto data1 = (load8u(ptr + 8) << 1) + 1; // encode odd index to data
auto mx = max_u32(data0, data1); // the LSB contains one bit of index
Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).

UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.
UPD: MISSED THE ABS
Terribly sorry about that, I missed the absolute value from the definition.
I do not have the measurements, but here are all 3 functions vectorised:
max value with abs: https://godbolt.org/z/6Wznrc5qq
find with abs: https://godbolt.org/z/61r9Efxvn
one pass with abs: https://godbolt.org/z/EvdbfnWjb
Asm stashed in a gist
On the method
The way to do max element with simd is to first find the value and then find the index.
Alternatively you have to keep a register of indexes and blend the indexes.
This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.
Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data
The min_element is my implementation of keeping the index.
reduce(min) + find is doing two loops - first get the value, then find where.
For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.
For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.
How to do it
finding the maximum value is auto-vectorised across all platforms if you write it as reduce
if (!arr.size()) return {};
// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
res = res > arr[i] ? res : arr[i];
}
return res;
https://godbolt.org/z/EsazWf1vT
Now the find portion is trickier, non of the compilers I know autovectorize find
We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj
Or I explain how to implement find in this talk if you want to do it yourself.
UPD:
UPD2: changed the blend to max
#Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data.
I have no evidence of this - my measurements point to reduce + find.
However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here)
https://godbolt.org/z/djrzobEj4
AVX2 main loop:
.L6:
vmovups ymm6, YMMWORD PTR [rdx]
add rdx, 32
vcmpps ymm3, ymm6, ymm0, 30
vmaxps ymm0, ymm6, ymm0
vpblendvb ymm3, ymm2, ymm1, ymm3
vpaddd ymm1, ymm5, ymm1
vmovdqa ymm2, ymm3
cmp rcx, rdx
jne .L6
ARM-64 main loop:
.L6:
ldr q3, [x0], 16
fcmgt v4.4s, v3.4s, v0.4s
fmax v0.4s, v3.4s, v0.4s
bit v1.16b, v2.16b, v4.16b
add v2.4s, v2.4s, v5.4s
cmp x0, x1
bne .L6
Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2

I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.
Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:
// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
const __m128 greater = _mm_cmpgt_ps( vec, aa );
aa = _mm_max_ps( vec, aa );
// If you don't have SSE4, emulate with bitwise ops: and, andnot, or
maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}
float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
// Initialize things
const float* const end = rsi + length;
const float* const endAligned = rsi + ( ( length / 4 ) * 4 );
__m128 aa = _mm_set1_ps( -1 );
__m128 maxIdx = _mm_setzero_ps();
__m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );
// Main vectorized portion
while( rsi < endAligned )
{
__m128 vec = _mm_loadu_ps( rsi );
rsi += 4;
updateMax( vec, idx, aa, maxIdx );
idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
}
// Handle the remainder, if present
if( rsi < end )
{
__m128 vec;
if( length > 4 )
{
// The source has at least 5 elements
// Offset the source pointer + index back, by a few elements
const int offset = (int)( 4 - ( length % 4 ) );
rsi -= offset;
idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
vec = _mm_loadu_ps( rsi );
}
else
{
// The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
alignas( 16 ) float buff[ 4 ];
_mm_store_ps( buff, _mm_setzero_ps() );
for( size_t i = 0; i < length; i++ )
buff[ i ] = rsi[ i ];
vec = _mm_load_ps( buff );
}
updateMax( vec, idx, aa, maxIdx );
}
// Reduce to scalar
__m128 tmpMax = _mm_movehl_ps( aa, aa );
__m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
__m128 greater = _mm_cmpgt_ps( tmpMax, aa );
aa = _mm_max_ps( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
// SSE3 has 100% market penetration in 2022
tmpMax = _mm_movehdup_ps( tmpMax );
tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
greater = _mm_cmpgt_ss( tmpMax, aa );
aa = _mm_max_ss( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
return _mm_cvtss_f32( aa );
}
As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.
SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.
The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocate\n");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%p\n", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }\n");
printf("took %f secs\n", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

Related:
an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.
It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).
An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t x = *(input++); // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555; // 0b...01010101;
uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo; // can do up to 3 iterations of this without overflow
accum2_hi += hi; // because a 2-bit integer overflows at 4
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike #njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from #njuffa's code.
But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or #njuffa's with AVX2 autovectorization, #njuffa's is faster on Skylake with -march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i
void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
const uint64_t *endp = input + length - 3*4*21; // 252 masks per outer iteration
while(input <= endp) {
uint64_t accum8[8] = {0}; // 8-bit accumulators
for (int k=0 ; k<21 ; k++) {
uint64_t accum4[4] = {0}; // 4-bit accumulators can hold counts up to 15. We use 4*3=12
for(int j=0 ; j<4 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;
uint64_t lo = x & even_1bits; // 0b...01010101;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this without overflow
}
const uint64_t even_2bits = 0x3333333333333333;
accum4[0] += accum2_lo & even_2bits; // 0b...001100110011; // same constant 4 times, because we shift *first*
accum4[1] += (accum2_lo >> 2) & even_2bits;
accum4[2] += accum2_hi & even_2bits;
accum4[3] += (accum2_hi >> 2) & even_2bits;
}
for (int i = 0 ; i<4 ; i++) {
accum8[i*2 + 0] += accum4[i] & 0x0f0f0f0f0f0f0f0f;
accum8[i*2 + 1] += (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
}
}
// char* can safely alias anything.
unsigned char *narrow = (uint8_t*) accum8;
for (int i=0 ; i<64 ; i++){
target[i] += narrow[i];
}
}
/* target[0] = bit 0
* target[1] = bit 8
* ...
* target[8] = bit 1
* target[9] = bit 9
* ...
*/
// TODO: 8x8 transpose
}
We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from #njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results.
But the printed counts match another position of the reference array.)
I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: #njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but #njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.
I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.
But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.
For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)
TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.
Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.
(vs. 0.01 sec for #njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using #chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.
If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().

One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 # 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocate\n");
return EXIT_FAILURE;
}
printf("p=%p\n", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error # %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;
static inline __m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
const __m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: byte == 0 ? 0 : -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}
void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
{
__m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
for (size_t inner = 0; inner < block_size; ++inner) {
for (size_t j = 0; j < bits / 32; j++)
{
const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
}
}
for (size_t j = 0; j < bits; j++)
{
accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
}
}
for (size_t outer = count - (count % block_size); outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by #njuffa.

See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here:
https://github.com/mklarqvist/positional-popcount
// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

Creating a mask with N least significant bits set

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.

Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
ETA:
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int

Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}

This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.

#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away

When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

The best way to shift a __m128i?

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m).
What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit:
I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
Performance:
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}

In SSE4.A the instructions insrq and extrq can be used to shift (and rotate) through __mm128i 1-64 bits at a time. Unlike the 8/16/32/64 bit counterparts pextrN/pinsrX, these instructions select or insert m bits (between 1 and 64) at any bit offset from 0 to 127. The caveat is that the sum of lenght and offset must not exceed 128.

Is there a more efficient way of splitting a number into its digits?

I have to split a number into its digits in order to display it on an LCD. Right now I use the following method:
pos = 7;
do
{
LCD_Display(pos, val % 10);
val /= 10;
pos--;
} while (pos >= 0 && val);
The problem with this method is that division and modulo operations are extremely slow on an MSP430 microcontroller. Is there any alternative to this method, something that either does not involve division or that reduces the number of operations?
A note: I can't use any library functions, such as itoa. The libraries are big and the functions themselves are rather resource hungry (both in terms of number of cycles, and RAM usage).

You could do subtractions in a loop with predefined base 10 values.
My C is a bit rusty, but something like this:
int num[] = { 10000000,1000000,100000,10000,1000,100,10,1 };
for (pos = 0; pos < 8; pos++) {
int cnt = 0;
while (val >= num[pos]) {
cnt++;
val -= num[pos];
}
LCD_Display(pos, cnt);
}

Yes, there's another way, originally invented (at least AFAIK) by Terje Mathiesen. Instead of dividing by 10, you (sort of) multiply by the reciprocal. The trick, of course, is that in integers you can't represent the reciprocal directly. To make up for that, you work with scaled integers. If we had floating point, we could extract digits with something like:
input = 123
first digit = integer(10 * (fraction(input * .1))
second digit = integer(100 * (fraction(input * .01))
...and so on for as many digits as needed. To do this with integers, we basically just scale those by 232 (and round each up, since we'll use truncating math). In C, the algorithm looks like this:
#include <stdio.h>
// here are our scaled factors
static const unsigned long long factors[] = {
3435973837, // ceil((0.1 * 2**32)<<3)
2748779070, // ceil((0.01 * 2**32)<<6)
2199023256, // etc.
3518437209,
2814749768,
2251799814,
3602879702,
2882303762,
2305843010
};
static const char shifts[] = {
3, // the shift value used for each factor above
6,
9,
13,
16,
19,
23,
26,
29
};
int main() {
unsigned input = 13754;
for (int i=8; i!=-1; i--) {
unsigned long long inter = input * factors[i];
inter >>= shifts[i];
inter &= (unsigned)-1;
inter *= 10;
inter >>= 32;
printf("%u", inter);
}
return 0;
}
The operations in the loop will map directly to instructions on most 32-bit processors. Your typical multiply instruction will take 2 32-bit inputs, and produce a 64-bit result, which is exactly what we need here. It'll typically be quite a bit faster than a division instruction as well. In a typical case, some of the operations will (or at least with some care, can) disappear in assembly language. For example, where I've done the inter &= (unsigned)-1;, in assembly language you'll normally be able to just use the lower 32-bit register where the result was stored, and just ignore whatever holds the upper 32 bits. Likewise, the inter >>= 32; just means we use the value in the upper 32-bit register, and ignore the lower 32-bit register.
For example, in x86 assembly language, this comes out something like:
mov ebx, 9 ; maximum digits we can deal with.
mov esi, offset output_buffer
next_digit:
mov eax, input
mul factors[ebx*4]
mov cl, shifts[ebx]
shrd eax, edx, cl
mov edx, 10 ; overwrite edx => inter &= (unsigned)-1
mul edx
add dl, '0'
mov [esi], dl ; effectively shift right 32 bits by ignoring 32 LSBs in eax
inc esi
dec ebx
jnz next_digit
mov [esi], bl ; zero terminate the string
For the moment, I've cheated a tiny bit, and written the code assuming an extra item at the beginning of each table (factors and shifts). This isn't strictly necessary, but simplifies the code at the cost of wasting 8 bytes of data. It's pretty easy to do away with that too, but I haven't bothered for the moment.
In any case, doing away with the division makes this a fair amount faster on quite a few low- to mid-range processors that lack dedicated division hardware.

Another way is using double dabble. This is a way to convert binary to BCD with only additions and bit shifts so it's very appropriate for microcontrollers. After splitting to BCDs you can easily print out each number

I would use a temporary string, like:
char buffer[8];
itoa(yourValue, buffer, 10);
int pos;
for(pos=0; pos<8; ++pos)
LCD_Display(pos, buffer[pos]); /* maybe you'll need a cast here */
edit: since you can't use library's itoa, then I think your solution is already the best, providing you compile with max optimization turned on.
You may take a look at this: Most optimized way to calculate modulus in C

This is my attempt at a complete solution. Credit should go to Guffa for providing the general idea. This should work for 32bit integers, signed or otherwise and 0.
#include <stdlib.h>
#include <stdio.h>
#define MAX_WIDTH (10)
static unsigned int uiPosition[] = {
1u,
10u,
100u,
1000u,
10000u,
100000u,
1000000u,
10000000u,
100000000u,
1000000000u,
};
void uitostr(unsigned int uiSource, char* cTarget)
{
int i, c=0;
for( i=0; i!=MAX_WIDTH; ++i )
{
cTarget[i] = 0;
}
if( uiSource == 0 )
{
cTarget[0] = '0';
cTarget[1] = '\0';
return;
}
for( i=MAX_WIDTH -1; i>=0; --i )
{
while( uiSource >= uiPosition[i] )
{
cTarget[c] += 1;
uiSource -= uiPosition[i];
}
if( c != 0 || cTarget[c] != 0 )
{
cTarget[c] += 0x30;
c++;
}
}
cTarget[c] = '\0';
}
void itostr(int iSource, char* cTarget)
{
if( iSource < 0 )
{
cTarget[0] = '-';
uitostr((unsigned int)(iSource * -1), cTarget + 1);
}
else
{
uitostr((unsigned int)iSource, cTarget);
}
}
int main()
{
char szStr[MAX_WIDTH +1] = { 0 };
// signed integer
printf("Signed integer\n");
printf("int: %d\n", 100);
itostr(100, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", -1);
itostr(-1, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", 1000000000);
itostr(1000000000, szStr);
printf("str: %s\n", szStr);
printf("int: %d\n", 0);
itostr(0, szStr);
printf("str: %s\n", szStr);
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight