Unset the most significant bit in a word (int32) [C] - c

How can I unset the most significant setted bit of a word (e.g. 0x00556844 -> 0x00156844)? There is a __builtin_clz in gcc, but it just counts the zeroes, which is unneeded to me. Also, how should I replace __builtin_clz for msvc or intel c compiler?
Current my code is
int msb = 1<< ((sizeof(int)*8)-__builtin_clz(input)-1);
int result = input & ~msb;
UPDATE: Ok, if you says that this code is rather fast, I'll ask you, how should I add a portability to this code? This version is for GCC, but MSVC & ICC?

Just round down to the nearest power of 2 and then XOR that with the original value, e.g. using flp2() from Hacker's Delight:
uint32_t flp2(uint32_t x) // round x down to nearest power of 2
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >>16);
return x - (x >> 1);
}
uint32_t clr_msb(uint32_t x) // clear most significant set bit in x
{
msb = flp2(x); // get MS set bit in x
return x ^ msb; // XOR MS set bit to clear it
}

If you are truly concerned with performance, the best way to clear the msb has recently changed for x86 with the addition of BMI instructions.
In x86 assembly:
clear_msb:
bsrq %rdi, %rax
bzhiq %rax, %rdi, %rax
retq
Now to rewrite in C and let the compiler emit these instructions while gracefully degrading for non-x86 architectures or older x86 processors that don't support BMI instructions.
Compared to the assembly code, the C version is really ugly and verbose. But at least it meets the objective of portability. And if you have the necessary hardware and compiler directives (-mbmi, -mbmi2) to match, you're back to the beautiful assembly code after compilation.
As written, bsr() relies on a GCC/Clang builtin. If targeting other compilers you can replace with equivalent portable C code and/or different compiler-specific builtins.
#include <inttypes.h>
#include <stdio.h>
uint64_t bsr(const uint64_t n)
{
return 63 - (uint64_t)__builtin_clzll(n);
}
uint64_t bzhi(const uint64_t n,
const uint64_t index)
{
const uint64_t leading = (uint64_t)1 << index;
const uint64_t keep_bits = leading - 1;
return n & keep_bits;
}
uint64_t clear_msb(const uint64_t n)
{
return bzhi(n, bsr(n));
}
int main(void)
{
uint64_t i;
for (i = 0; i < (uint64_t)1 << 16; ++i) {
printf("%" PRIu64 "\n", clear_msb(i));
}
return 0;
}
Both assembly and C versions lend themselves naturally to being replaced with 32-bit instructions, as the original question was posed.

You can do
unsigned resetLeadingBit(uint32_t x) {
return x & ~(0x80000000U >> __builtin_clz(x))
}
For MSVC there is _BitScanReverse, which is 31-__builtin_clz().
Actually its the other way around, BSR is the natural x86 instruction, and the gcc intrinsic is implemented as 31-BSR.

Related

How can one make this dynamic bit range code GCC compliant for 64 bit compilers?

I am trying to update for linux, GCC, and 64 bit use and preserve in a github Ken Silverman's Paint N Draw 3D C software. I got his permission but he's too busy to help. I don't want to do a bad job and I am not a bit-twiddling expert so I'd like to fix the main parts before I upload it.
In his code pnd3d.c he used a struct called bitmal_t * that contains a malloc (I think his element mal means the size of a malloc) and a size to indicate a voxel-distance as an unsigned int (in 2009 it was a 32 bit ) bit chain amongst the bits of a concatenated set of 32 bit ints. So basically, distance is a function of how many bits on (1) along the extended bit chain. For collisions, he looks up and down for zeros and ones.
Here is his bitmal_t:
//buf: cast to: octv_t* or surf_t*
//bit: 1 bit per sizeof(buf[0]); 0=free, 1=occupied
typedef struct bit { void *buf; unsigned int mal, *bit, ind, num, siz; } bitmal_t;
Here is his range finding code that goes up and down the bit-range looking for a one or a zero. I posted his originals, not my crappy nonworking version.
Here is all the code snippets you would need to reproduce it.
static __forceinline int dntil0 (unsigned int *lptr, int z, int zsiz)
{
// //This line does the same thing (but slow & brute force)
//while ((z < zsiz) && (lptr[z>>5]&(1<<KMOD32(z)))) z++; return(z);
int i;
//WARNING: zsiz must be multiple of 32!
i = (lptr[z>>5]|((1<<KMOD32(z))-1)); z &= ~31;
while (i == 0xffffffff)
{
z += 32; if (z >= zsiz) return(zsiz);
i = lptr[z>>5];
}
return(bsf(~i)+z);
}
static __forceinline int uptil0 (unsigned int *lptr, int z)
{
// //This line does the same thing (but slow & brute force)
//while ((z > 0) && (lptr[(z-1)>>5]&(1<<KMOD32(z-1)))) z--; return(z);
int i;
if (!z) return(0); //Prevent possible crash
i = (lptr[(z-1)>>5]|(-1<<KMOD32(z))); z &= ~31;
while (i == 0xffffffff)
{
z -= 32; if (z < 0) return(0);
i = lptr[z>>5];
}
return(bsr(~i)+z+1);
}
static __forceinline int dntil1 (unsigned int *lptr, int z, int zsiz)
{
// //This line does the same thing (but slow & brute force)
//while ((z < zsiz) && (!(lptr[z>>5]&(1<<KMOD32(z))))) z++; return(z);
int i;
//WARNING: zsiz must be multiple of 32!
i = (lptr[z>>5]&(-1<<KMOD32(z))); z &= ~31;
while (!i)
{
z += 32; if (z >= zsiz) return(zsiz);
i = lptr[z>>5];
}
return(bsf(i)+z);
}
static __forceinline int uptil1 (unsigned int *lptr, int z)
{
// //This line does the same thing (but slow & brute force)
//while ((z > 0) && (!(lptr[(z-1)>>5]&(1<<KMOD32(z-1))))) z--; return(z);
int i;
if (!z) return(0); //Prevent possible crash
i = (lptr[(z-1)>>5]&((1<<KMOD32(z))-1)); z &= ~31;
while (!i)
{
z -= 32; if (z < 0) return(0);
i = lptr[z>>5];
}
return(bsr(i)+z+1);
}
Here are his set range to ones and zeroes functions:
//Set all bits in vbit from (x,y,z0) to (x,y,z1-1) to 0's
#ifndef _WIN64
static __forceinline void setzrange0 (void *vptr, int z0, int z1)
{
int z, ze, *iptr = (int *)vptr;
if (!((z0^z1)&~31)) { iptr[z0>>5] &= ((~(-1<<z0))|(-1<<z1)); return; }
z = (z0>>5); ze = (z1>>5);
iptr[z] &=~(-1<<z0); for(z++;z<ze;z++) iptr[z] = 0;
iptr[z] &= (-1<<z1);
}
//Set all bits in vbit from (x,y,z0) to (x,y,z1-1) to 1's
static __forceinline void setzrange1 (void *vptr, int z0, int z1)
{
int z, ze, *iptr = (int *)vptr;
if (!((z0^z1)&~31)) { iptr[z0>>5] |= ((~(-1<<z1))&(-1<<z0)); return; }
z = (z0>>5); ze = (z1>>5);
iptr[z] |= (-1<<z0); for(z++;z<ze;z++) iptr[z] = -1;
iptr[z] |=~(-1<<z1);
}
#else
static __forceinline void setzrange0 (void *vptr, __int64 z0, __int64 z1)
{
unsigned __int64 z, ze, *iptr = (unsigned __int64 *)vptr;
if (!((z0^z1)&~63)) { iptr[z0>>6] &= ((~(LL(-1)<<z0))|(LL(-1)<<z1)); return; }
z = (z0>>6); ze = (z1>>6);
iptr[z] &=~(LL(-1)<<z0); for(z++;z<ze;z++) iptr[z] = LL(0);
iptr[z] &= (LL(-1)<<z1);
}
//Set all bits in vbit from (x,y,z0) to (x,y,z1-1) to 1's
static __forceinline void setzrange1 (void *vptr, __int64 z0, __int64 z1)
{
unsigned __int64 z, ze, *iptr = (unsigned __int64 *)vptr;
if (!((z0^z1)&~63)) { iptr[z0>>6] |= ((~(LL(-1)<<z1))&(LL(-1)<<z0)); return; }
z = (z0>>6); ze = (z1>>6);
iptr[z] |= (LL(-1)<<z0); for(z++;z<ze;z++) iptr[z] = LL(-1);
iptr[z] |=~(LL(-1)<<z1);
}
#endif
Write some unit tests that pass on the original!
First of all, SSE2 is baseline for x86-64, so you should definitely be using that instead of just 64-bit integers.
GCC (unlike MSVC) assumes no strict-aliasing violations, so the set bit range functions (that cast an incoming pointer to signed int* (!!) or uint64_t* depending on WIN64 or not) might need to be compiled with -fno-strict-aliasing to make pointer-casting well-defined.
You could replace the loop part of the set/clear bit-range functions with memset (which gcc may inline), or a hand-written SSE intrinsics loop if you expect the size to usually be small (like under 200 bytes or so, not worth the overhead of calling libc memset)
I think those dntil0 functions in the first block are just bit-search loops for the first 0 or first 1 bit, forward or backward.
Rewrite them from scratch with SSE2 intrinsics: _mm_cmpeq_epi8 / _mm_movemask_epi8 to find the first byte that isn't all-0 or all-1 bits, then use bsf or bsr on that.
See the glibc source code for SSE2 memchr, or any simpler SSE2-optimized implementation, to find out how to do the byte-search part. Or glibc memmem for an example of comparing for equal, but that's easy: instead of looking for a non-zero _mm_movemask_epi8() (indicating there was a match), look for a result that's != 0xffff (all ones) to indicate that there was a mismatch. Use bsf or bsr on that bitmask to find the byte index into the SIMD vector.
So in total you'll use BSR or BSF twice in each function: one to find the byte index within the SIMD vector, and again to find the bit-index within the target byte.
For the bit-scan function, use GCC __builtin_clz or __builtin_ctz to find the first 1 bit. Bit twiddling: which bit is set?
To search for the first zero instead of the first one, bitwise invert, like __builtin_ctz( ~p[idx] ) where p is an unsigned char* into your search buffer (that you were using _mm_loadu_si128() on), and idx is an offset within that 16 byte window. (That you calculated with __builtin_ctz() on the movemask result that broke out of the vector loop.)
How the original worked:
z -= 32 is looping by 32 bits (the size of an int, because this was written assuming it would be compiled for x86 Windows or x86-64 Windows).
lptr[z>>5]; is converting the bit index to an int index. So it's simply looping over the buffer 1 int at a time.
When it finds a 4-byte element that's != 0xFFFFFFFF, it has found an int containing a bit that's not 1; i.e. it contains the bit we're looking for. So it uses bsf or bsr to bit-scan and find the position of that bit within this int.
It adds that to z (the bit-position of the start of this int).
This is exactly the same algorithm I described above, but implemented one integer at a time instead of 16 bytes at a time.
It should really be using uint32_t or unsigned int for bit-manipulations, not signed int, but it obviously works correctly on MSVC.
if (z >= zsiz) return(zsiz); This is the size check to break out of the loop if no bit is found.

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.
Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)
I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:
// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
if (mask & (1ull << i)) {
target[i]++
}
}
The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.
Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];
int main(int argc, char *argv[]) {
int i, j;
unsigned long x = 123;
unsigned long m = 1;
char *p = malloc(8 * 10000000);
if (!p) {
printf("failed to allocate\n");
exit(0);
}
memset(p, 0xff, 80000000);
printf("p=%p\n", p);
unsigned long *pLong = (unsigned long*)p;
double start = getTS();
for (j = 0; j < 10000000; j++) {
m = 1;
for (i = 0; i < 64; i++) {
if ((pLong[j] & m) == m) {
target[i]++;
}
m = (m << 1);
}
}
printf("took %f secs\n", getTS() - start);
return 0;
}
Thanks in advance!
On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.
Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.
Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.
Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.
Here is an improved version using this method. it runs in 45ms on my system.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>
double getTS() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}
int main(int argc, char *argv[]) {
unsigned int target[64] = { 0 };
unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
int i, j;
if (!pLong) {
printf("failed to allocate\n");
exit(1);
}
memset(pLong, 0xff, sizeof(*pLong) * 10000000);
printf("p=%p\n", (void*)pLong);
double start = getTS();
uint64_t inflate[256];
for (i = 0; i < 256; i++) {
uint64_t x = i;
x = (x | (x << 28));
x = (x | (x << 14));
inflate[i] = (x | (x << 7)) & 0x0101010101010101ULL;
}
for (j = 0; j < 10000000 / 255 * 255; j += 255) {
uint64_t b[8] = { 0 };
for (int k = 0; k < 255; k++) {
uint64_t u = pLong[j + k];
for (int kk = 0; kk < 8; kk++, u >>= 8)
b[kk] += inflate[u & 255];
}
for (i = 0; i < 64; i++)
target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
}
for (; j < 10000000; j++) {
uint64_t m = 1;
for (i = 0; i < 64; i++) {
target[i] += (pLong[j] >> i) & 1;
m <<= 1;
}
}
printf("target = {");
for (i = 0; i < 64; i++)
printf(" %d", target[i]);
printf(" }\n");
printf("took %f secs\n", getTS() - start);
return 0;
}
The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.
Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.
A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.
Related:
an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.
Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.
Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.
And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.
See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.
It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.
The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.
The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).
An even better strategy is to widen gradually
Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.
Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.
uint64_t x = *(input++); // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555; // 0b...01010101;
uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo; // can do up to 3 iterations of this without overflow
accum2_hi += hi; // because a 2-bit integer overflows at 4
Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).
Biggest downside: this doesn't auto-vectorize, unlike #njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from #njuffa's code.
But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)
With good manual vectorization, this should be a factor of almost 2 or 4 faster.
But if you have to choose between this scalar or #njuffa's with AVX2 autovectorization, #njuffa's is faster on Skylake with -march=native
If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).
// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i
void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
const uint64_t *endp = input + length - 3*4*21; // 252 masks per outer iteration
while(input <= endp) {
uint64_t accum8[8] = {0}; // 8-bit accumulators
for (int k=0 ; k<21 ; k++) {
uint64_t accum4[4] = {0}; // 4-bit accumulators can hold counts up to 15. We use 4*3=12
for(int j=0 ; j<4 ; j++){
uint64_t accum2_lo=0, accum2_hi=0;
for(int i=0 ; i<3 ; i++) { // the compiler should fully unroll this
uint64_t x = *input++; // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;
uint64_t lo = x & even_1bits; // 0b...01010101;
uint64_t hi = (x>>1) & even_1bits; // or use ANDN before shifting to avoid a MOV copy
accum2_lo += lo;
accum2_hi += hi; // can do up to 3 iterations of this without overflow
}
const uint64_t even_2bits = 0x3333333333333333;
accum4[0] += accum2_lo & even_2bits; // 0b...001100110011; // same constant 4 times, because we shift *first*
accum4[1] += (accum2_lo >> 2) & even_2bits;
accum4[2] += accum2_hi & even_2bits;
accum4[3] += (accum2_hi >> 2) & even_2bits;
}
for (int i = 0 ; i<4 ; i++) {
accum8[i*2 + 0] += accum4[i] & 0x0f0f0f0f0f0f0f0f;
accum8[i*2 + 1] += (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
}
}
// char* can safely alias anything.
unsigned char *narrow = (uint8_t*) accum8;
for (int i=0 ; i<64 ; i++){
target[i] += narrow[i];
}
}
/* target[0] = bit 0
* target[1] = bit 8
* ...
* target[8] = bit 1
* target[9] = bit 9
* ...
*/
// TODO: 8x8 transpose
}
We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.
We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)
Benchmark results on Arch Linux i7-6700k from #njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results.
But the printed counts match another position of the reference array.)
I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).
ref: the best bit-loop (next section)
fast: #njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.
gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs
-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but #njuffa's most because more of it vectorizes (including its inner-most loop):
gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs
Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs
-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.
I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.
But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.
For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)
TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.
Obsolete code: improvements to the bit-loop
Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.
(vs. 0.01 sec for #njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)
If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.
Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.
Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.
But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.
An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using #chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):
for (int j = 0; j < 10000000; j++) {
#if 1 // extract low bit directly
unsigned long long tmp = pLong[j];
for (int i=0 ; i<64 ; i++) { // while(tmp) could mispredict, but good for sparse data
target[i] += tmp&1;
tmp >>= 1;
}
#else // bool -> int shifting a mask
unsigned long m = 1;
for (i = 0; i < 64; i++) {
target[i]+= (pLong[j] & m) != 0;
m = (m << 1);
}
#endif
Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.
Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.
# clang7.0 -O3 -march=sandybridge
.LBB1_2: # =>This Loop Header: Depth=1
# outer loop loads a uint64 from the src
mov rdx, qword ptr [r14 + 8*rbx]
mov rsi, -256
.LBB1_3: # Parent Loop BB1_2 Depth=1
# do {
mov edi, edx
and edi, 1 # isolate the low bit
add dword ptr [rsi + target+256], edi # and += into target
mov edi, edx
shr edi
and edi, 1 # isolate the 2nd bit
add dword ptr [rsi + target+260], edi
shr rdx, 2 # tmp >>= 2;
add rsi, 8
jne .LBB1_3 # } while(offset += 8 != 0);
This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.
If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.
On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).
vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.
This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().
One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:
/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 # 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:
p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs
where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
/*
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
#define N (10000000)
#define BITS (64)
#define BLOCK_SIZE (255)
/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
int jj, k, kk;
uint64_t byte_wise_sum [BITS/8] = {0};
for (jj = lo; jj < hi; jj++) {
uint64_t t = pLong[jj];
for (k = 0; k < BITS/8; k++) {
byte_wise_sum[k] += t & 0x0101010101010101;
t >>= 1;
}
}
/* accumulate byte sums into target */
for (k = 0; k < BITS/8; k++) {
for (kk = 0; kk < BITS; kk += 8) {
target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
}
}
}
int main (void)
{
double start_ref, stop_ref, start, stop;
uint64_t *pLong;
unsigned int target_ref [BITS] = {0};
unsigned int target [BITS] = {0};
int i, j;
pLong = malloc (sizeof(pLong[0]) * N);
if (!pLong) {
printf("failed to allocate\n");
return EXIT_FAILURE;
}
printf("p=%p\n", pLong);
/* init data */
for (j = 0; j < N; j++) {
pLong[j] = KISS64;
}
/* count bits slowly */
start_ref = second();
for (j = 0; j < N; j++) {
uint64_t m = 1;
for (i = 0; i < BITS; i++) {
if ((pLong[j] & m) == m) {
target_ref[i]++;
}
m = (m << 1);
}
}
stop_ref = second();
/* count bits fast */
start = second();
for (j = 0; j < N / BLOCK_SIZE; j++) {
sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
}
sum_block (pLong, target, j * BLOCK_SIZE, N);
stop = second();
/* check whether result is correct */
for (i = 0; i < BITS; i++) {
if (target[i] != target_ref[i]) {
printf ("error # %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
}
}
/* print benchmark results */
printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
return EXIT_SUCCESS;
}
For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.
So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325
Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.
After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).
At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.
typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;
static inline __m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x); // we only use the low 32bits of each lane, but this is fine with AVX2
// Each byte gets the source byte containing the corresponding bit
const __m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);
// this is the extra step: byte == 0 ? 0 : -1
return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}
void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
{
__m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
for (size_t inner = 0; inner < block_size; ++inner) {
for (size_t j = 0; j < bits / 32; j++)
{
const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
}
}
for (size_t j = 0; j < bits; j++)
{
accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
}
}
for (size_t outer = count - (count % block_size); outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
for (size_t outer = 0; outer < count; outer++)
{
for (size_t j = 0; j < bits; j++)
{
if (data[outer] & (T(1) << j))
{
accumulator[j]++;
}
}
}
}
Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.
On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.
Surprisingly, this is actually still 50% slower than the solution proposed by #njuffa.
See
Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)
Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).
Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).
Positional popcnt for 16-bit subwords is implemented here:
https://github.com/mklarqvist/positional-popcount
// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry
Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

How to check the number of set bits in an 8-bit unsigned char?

So I have to find the set bits (on 1) of an unsigned char variable in C?
A similar question is How to count the number of set bits in a 32-bit integer? But it uses an algorithm that's not easily adaptable to 8-bit unsigned chars (or its not apparent).
The algorithm suggested in the question How to count the number of set bits in a 32-bit integer? is trivially adapted to 8 bit:
int NumberOfSetBits( uint8_t b )
{
b = b - ((b >> 1) & 0x55);
b = (b & 0x33) + ((b >> 2) & 0x33);
return (((b + (b >> 4)) & 0x0F) * 0x01);
}
It is simply a case of shortening the constants the the least significant eight bits, and removing the final 24 bit right-shift. Equally it could be adapted for 16bit using an 8 bit shift. Note that in the case for 8 bit, the mechanical adaptation of the 32 bit algorithm results in a redundant * 0x01 which could be omitted.
The fastest approach for an 8-bit variable is using a lookup table.
Build an array of 256 values, one per 8-bit combination. Each value should contain the count of bits in its corresponding index:
int bit_count[] = {
// 00 01 02 03 04 05 06 07 08 09 0a, ... FE FF
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, ..., 7, 8
};
Getting a count of a combination is the same as looking up a value from the bit_count array. The advantage of this approach is that it is very fast.
You can generate the array using a simple program that counts bits one by one in a slow way:
for (int i = 0 ; i != 256 ; i++) {
int count = 0;
for (int p = 0 ; p != 8 ; p++) {
if (i & (1 << p)) {
count++;
}
}
printf("%d, ", count);
}
(demo that generates the table).
If you would like to trade some CPU cycles for memory, you can use a 16-byte lookup table for two 4-bit lookups:
static const char split_lookup[] = {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
};
int bit_count(unsigned char n) {
return split_lookup[n&0xF] + split_lookup[n>>4];
}
Demo.
I think you are looking for Hamming Weight algorithm for 8bits?
If it is true, here is the code:
unsigned char in = 22; //This is your input number
unsigned char out = 0;
in = in - ((in>>1) & 0x55);
in = (in & 0x33) + ((in>>2) & 0x33);
out = ((in + (in>>4) & 0x0F) * 0x01) ;
Counting the number of digits different than 0 is also known as a Hamming Weight. In this case, you are counting the number of 1's.
Dasblinkenlight provided you with a table driven implementation, and Olaf provided you with a software based solution. I think you have two other potential solutions. The first is to use a compiler extension, the second is to use an ASM specific instruction with inline assembly from C.
For the first alternative, see GCC's __builtin_popcount(). (Thanks to Artless Noise).
For the second alternative, you did not specify the embedded processor, but I'm going to offer this in case its ARM based.
Some ARM processors have the VCNT instruction, which performs the count for you. So you could do it from C with inline assembly:
inline
unsigned int hamming_weight(unsigned char value) {
__asm__ __volatile__ (
"VCNT.8"
: "=value"
: "value"
);
return value;
}
Also see Fastest way to count number of 1s in a register, ARM assembly.
For completeness, here is Kernighan's bit counting algorithm:
int count_bits(int n) {
int count = 0;
while(n != 0) {
n &= (n-1);
count++;
}
return count;
}
Also see Please explain the logic behind Kernighan's bit counting algorithm.
I made an optimized version. With a 32-bit processor, utilizing multiplication, bit shifting and masking can make smaller code for the same task, especially when the input domain is small (8-bit unsigned integer).
The following two code snippets are equivalent:
unsigned int bit_count_uint8(uint8_t x)
{
uint32_t n;
n = (uint32_t)(x * 0x08040201UL);
n = (uint32_t)(((n >> 3) & 0x11111111UL) * 0x11111111UL);
/* The "& 0x0F" will be optimized out but I add it for clarity. */
return (n >> 28) & 0x0F;
}
/*
unsigned int bit_count_uint8_traditional(uint8_t x)
{
x = x - ((x >> 1) & 0x55);
x = (x & 0x33) + ((x >> 2) & 0x33);
x = ((x + (x >> 4)) & 0x0F);
return x;
}
*/
This produces smallest binary code for IA-32, x86-64 and AArch32 (without NEON instruction set) as far as I can find.
For x86-64, this doesn't use the fewest number of instructions, but the bit shifts and downcasting avoid the use of 64-bit instructions and therefore save a few bytes in the compiled binary.
Interestingly, in IA-32 and x86-64, a variant of the above algorithm using a modulo ((((uint32_t)(x * 0x08040201U) >> 3) & 0x11111111U) % 0x0F) actually generates larger code, due to a requirement to move the remainder register for return value (mov eax,edx) after the div instruction. (I tested all of these in Compiler Explorer)
Explanation
I denote the eight bits of the byte x, from MSB to LSB, as a, b, c, d, e, f, g and h.
abcdefgh
* 00001000 00000100 00000010 00000001 (make 4 copies of x
--------------------------------------- with appropriate
abc defgh0ab cdefgh0a bcdefgh0 abcdefgh bit spacing)
>> 3
---------------------------------------
000defgh 0abcdefg h0abcdef gh0abcde
& 00010001 00010001 00010001 00010001
---------------------------------------
000d000h 000c000g 000b000f 000a000e
* 00010001 00010001 00010001 00010001
---------------------------------------
000d000h 000c000g 000b000f 000a000e
... 000h000c 000g000b 000f000a 000e
... 000c000g 000b000f 000a000e
... 000g000b 000f000a 000e
... 000b000f 000a000e
... 000f000a 000e
... 000a000e
... 000e
^^^^ (Bits 31-28 will contain the sum of the bits
a, b, c, d, e, f, g and h. Extract these
bits and we are done.)
Maybe not the fastest, but straightforward:
int count = 0;
for (int i = 0; i < 8; ++i) {
unsigned char c = 1 << i;
if (yourVar & c) {
//bit n°i is set
//first bit is bit n°0
count++;
}
}
For 8/16 bit MCUs, a loop will very likely be faster than the parallel-addition approach, as these MCUs cannot shift by more than one bit per instruction, so:
size_t popcount(uint8_t val)
{
size_t cnt = 0;
do {
cnt += val & 1U; // or: if ( val & 1 ) cnt++;
} while ( val >>= 1 ) ;
return cnt;
}
For the incrementation of cnt, you might profile. If still too slow, an assember implementation might be worth a try using carry flag (if available). While I am in against using assembler optimizations in general, such algorithms are one of the few good exceptions (still just after the C version fails).
If you can omit the Flash, a lookup table as proposed by #dasblinkenlight is likey the fastest approach.
Just a hint: For some architectures (notably ARM and x86/64), gcc has a builtin: __builtin_popcount(), you also might want to try if available (although it takes int at least). This might use a single CPU instruction - you cannot get faster and more compact.
Allow me to post a second answer. This one is the smallest possible for ARM processors with Advanced SIMD extension (NEON). It's even smaller than __builtin_popcount() (since __builtin_popcount() is optimized for unsigned int input, not uint8_t).
#ifdef __ARM_NEON
/* ARM C Language Extensions (ACLE) recommends us to check __ARM_NEON before
including <arm_neon.h> */
#include <arm_neon.h>
unsigned int bit_count_uint8(uint8_t x)
{
/* Set all lanes at once so that the compiler won't emit instruction to
zero-initialize other lanes. */
uint8x8_t v = vdup_n_u8(x);
/* Count the number of set bits for each lane (8-bit) in the vector. */
v = vcnt_u8(v);
/* Get lane 0 and discard other lanes. */
return vget_lane_u8(v, 0);
}
#endif

Computing high 64 bits of a 64x64 int product in C

I would like my C function to efficiently compute the high 64 bits of the product of two 64 bit signed ints. I know how to do this in x86-64 assembly, with imulq and pulling the result out of %rdx. But I'm at a loss for how to write this in C at all, let alone coax the compiler to do it efficiently.
Does anyone have any suggestions for writing this in C? This is performance sensitive, so "manual methods" (like Russian Peasant, or bignum libraries) are out.
This dorky inline assembly function I wrote works and is roughly the codegen I'm after:
static long mull_hi(long inp1, long inp2) {
long output = -1;
__asm__("movq %[inp1], %%rax;"
"imulq %[inp2];"
"movq %%rdx, %[output];"
: [output] "=r" (output)
: [inp1] "r" (inp1), [inp2] "r" (inp2)
:"%rax", "%rdx");
return output;
}
If you're using a relatively recent GCC on x86_64:
int64_t mulHi(int64_t x, int64_t y) {
return (int64_t)((__int128_t)x*y >> 64);
}
At -O1 and higher, this compiles to what you want:
_mulHi:
0000000000000000 movq %rsi,%rax
0000000000000003 imulq %rdi
0000000000000006 movq %rdx,%rax
0000000000000009 ret
I believe that clang and VC++ also have support for the __int128_t type, so this should also work on those platforms, with the usual caveats about trying it yourself.
The general answer is that x * y can be broken down into (a + b) * (c + d), where a and c are the high order parts.
First, expand to ac + ad + bc + bd
Now, you multiply the terms as 32 bit numbers stored as long long (or better yet, uint64_t), and you just remember that when you multiplied a higher order number, you need to scale by 32 bits. Then you do the adds, remembering to detect carry. Keep track of the sign. Naturally, you need to do the adds in pieces.
For code implementing the above, see my other answer.
With regard to your assembly solution, don't hard-code the mov instructions! Let the compiler do it for you. Here's a modified version of your code:
static long mull_hi(long inp1, long inp2) {
long output;
__asm__("imulq %2"
: "=d" (output)
: "a" (inp1), "r" (inp2));
return output;
}
Helpful reference: Machine Constraints
Since you did a pretty good job solving your own problem with the machine code, I figured you deserved some help with the portable version. I would leave an ifdef in where you do just use the assembly if in gnu on x86.
Anyway, here is an implementation based on my general answer. I'm pretty sure this is correct, but no guarantees, I just banged this out last night. You probably should get rid of the statics positive_result[] and result_negative - those are just artefacts of my unit test.
#include <stdlib.h>
#include <stdio.h>
// stdarg.h doesn't help much here because we need to call llabs()
typedef unsigned long long uint64_t;
typedef signed long long int64_t;
#define B32 0xffffffffUL
static uint64_t positive_result[2]; // used for testing
static int result_negative; // used for testing
static void mixed(uint64_t *result, uint64_t innerTerm)
{
// the high part of innerTerm is actually the easy part
result[1] += innerTerm >> 32;
// the low order a*d might carry out of the low order result
uint64_t was = result[0];
result[0] += (innerTerm & B32) << 32;
if (result[0] < was) // carry!
++result[1];
}
static uint64_t negate(uint64_t *result)
{
uint64_t t = result[0] = ~result[0];
result[1] = ~result[1];
if (++result[0] < t)
++result[1];
return result[1];
}
uint64_t higherMul(int64_t sx, int64_t sy)
{
uint64_t x, y, result[2] = { 0 }, a, b, c, d;
x = (uint64_t)llabs(sx);
y = (uint64_t)llabs(sy);
a = x >> 32;
b = x & B32;
c = y >> 32;
d = y & B32;
// the highest and lowest order terms are easy
result[1] = a * c;
result[0] = b * d;
// now have the mixed terms ad + bc to worry about
mixed(result, a * d);
mixed(result, b * c);
// now deal with the sign
positive_result[0] = result[0];
positive_result[1] = result[1];
result_negative = sx < 0 ^ sy < 0;
return result_negative ? negate(result) : result[1];
}
Wait, you have a perfectly good, optimized assembly solution already
working for this, and you want to back it out and try to write it in
an environment that doesn't support 128 bit math? I'm not following.
As you're obviously aware, this operation is a single instruction on
x86-64. Obviously nothing you do is going to make it work any better.
If you really want portable C, you'll need to do something like
DigitalRoss's code above and hope that your optimizer figures out what
you're doing.
If you need architecture portability but are willing to limit yourself
to gcc platforms, there are __int128_t (and __uint128_t) types in the
compiler intrinsics that will do what you want.

How to do unsigned saturating addition in C?

What is the best (cleanest, most efficient) way to write saturating addition in C?
The function or macro should add two unsigned inputs (need both 16- and 32-bit versions) and return all-bits-one (0xFFFF or 0xFFFFFFFF) if the sum overflows.
Target is x86 and ARM using gcc (4.1.2) and Visual Studio (for simulation only, so a fallback implementation is OK there).
You probably want portable C code here, which your compiler will turn into proper ARM assembly. ARM has conditional moves, and these can be conditional on overflow. The algorithm then becomes: add and conditionally set the destination to unsigned(-1), if overflow was detected.
uint16_t add16(uint16_t a, uint16_t b)
{
uint16_t c = a + b;
if (c < a) /* Can only happen due to overflow */
c = -1;
return c;
}
Note that this differs from the other algorithms in that it corrects overflow, instead of relying on another calculation to detect overflow.
x86-64 clang 3.7 -O3 output for adds32: significantly better than any other answer:
add edi, esi
mov eax, -1
cmovae eax, edi
ret
ARMv7: gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm output for adds32:
adds r0, r0, r1 # c, a, b
it cs
movcs r0, #-1 # conditional-move
bx lr
16bit: still doesn't use ARM's unsigned-saturating add instruction (UADD16)
add r1, r1, r0 # tmp114, a
movw r3, #65535 # tmp116,
uxth r1, r1 # c, tmp114
cmp r0, r1 # a, c
ite ls #
movls r0, r1 #,, c
movhi r0, r3 #,, tmp116
bx lr #
In plain C:
uint16_t sadd16(uint16_t a, uint16_t b) {
return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
uint32_t sadd32(uint32_t a, uint32_t b) {
return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}
which is almost macro-ized and directly conveys the meaning.
In IA32 without conditional jumps:
uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
__asm
{
mov eax,a
xor edx,edx
add eax,b
setnc dl
dec edx
or eax,edx
}
#elif defined ARM
// ARM code
#else
// non-IA32/ARM way, copy from above
#endif
}
In ARM you may already have saturated arithmetic built-in. The ARMv5 DSP-extensions can saturate registers to any bit-length. Also on ARM saturation is usually cheap because you can excute most instructions conditional.
ARMv6 even has saturated addition, subtraction and all the other stuff for 32 bits and packed numbers.
On the x86 you get saturated arithmetic either via MMX or SSE.
All this needs assembler, so it's not what you've asked for.
There are C-tricks to do saturated arithmetic as well. This little code does saturated addition on four bytes of a dword. It's based on the idea to calculate 32 half-adders in parallel, e.g. adding numbers without carry overflow.
This is done first. Then the carries are calculated, added and replaced with a mask if the addition would overflow.
uint32_t SatAddUnsigned8(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80808080;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 7);
return (x ^ t0) | t1;
}
You can get the same for 16 bits (or any kind of bit-field) by changing the signmask constant and the shifts at the bottom like this:
uint32_t SatAddUnsigned16(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80008000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 15);
return (x ^ t0) | t1;
}
uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80000000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 31);
return (x ^ t0) | t1;
}
Above code does the same for 16 and 32 bit values.
If you don't need the feature that the functions add and saturate multiple values in parallel just mask out the bits you need. On ARM you also want to change the signmask constant because ARM can't load all possible 32 bit constants in a single cycle.
Edit: The parallel versions are most likely slower than the straight forward methods, but they are faster if you have to saturate more than one value at a time.
If you care about performance, you really want to do this sort of stuff in SIMD, where x86 has native saturating arithmetic.
Because of this lack of saturating arithmetic in scalar math, one can get cases in which operations done on 4-variable-wide SIMD is more than 4 times faster than the equivalent C (and correspondingly true with 8-variable-wide SIMD):
sub8x8_dct8_c: 1332 clocks
sub8x8_dct8_mmx: 182 clocks
sub8x8_dct8_sse2: 127 clocks
Zero branch solution:
uint32_t sadd32(uint32_t a, uint32_t b)
{
uint64_t s = (uint64_t)a+b;
return -(s>>32) | (uint32_t)s;
}
A good compiler will optimize this to avoid doing any actual 64-bit arithmetic (s>>32 will merely be the carry flag, and -(s>>32) is the result of sbb %eax,%eax).
In x86 asm (AT&T syntax, a and b in eax and ebx, result in eax):
add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax
8- and 16-bit versions should be obvious. Signed version might require a bit more work.
uint32_t saturate_add32(uint32_t a, uint32_t b)
{
uint32_t sum = a + b;
if ((sum < a) || (sum < b))
return ~((uint32_t)0);
else
return sum;
} /* saturate_add32 */
uint16_t saturate_add16(uint16_t a, uint16_t b)
{
uint16_t sum = a + b;
if ((sum < a) || (sum < b))
return ~((uint16_t)0);
else
return sum;
} /* saturate_add16 */
Edit: Now that you've posted your version, I'm not sure mine is any cleaner/better/more efficient/more studly.
The current implementation we are using is:
#define sadd16(a, b) (uint16_t)( ((uint32_t)(a)+(uint32_t)(b)) > 0xffff ? 0xffff : ((a)+(b)))
#define sadd32(a, b) (uint32_t)( ((uint64_t)(a)+(uint64_t)(b)) > 0xffffffff ? 0xffffffff : ((a)+(b)))
I'm not sure if this is faster than Skizz's solution (always profile), but here's an alternative no-branch assembly solution. Note that this requires the conditional move (CMOV) instruction, which I'm not sure is available on your target.
uint32_t sadd32(uint32_t a, uint32_t b)
{
__asm
{
movl eax, a
addl eax, b
movl edx, 0xffffffff
cmovc eax, edx
}
}
I suppose, the best way for x86 is to use inline assembler to check overflow flag after addition. Something like:
add eax, ebx
jno ##1
or eax, 0FFFFFFFFh
##1:
.......
It's not very portable, but IMHO the most efficient way.
Just in case someone wants to know an implementation without branching using 2's complement 32bit integers.
Warning! This code uses the undefined operation: "shift right by -1" and therefore exploits the property of the Intel Pentium SAL instruction to mask the count operand to 5 bits.
int32_t sadd(int32_t a, int32_t b){
int32_t sum = a+b;
int32_t overflow = ((a^sum)&(b^sum))>>31;
return (overflow<<31)^(sum>>overflow);
}
It's the best implementation known to me
The best performance will usually involve inline assembly (as some have already stated).
But for portable C, these functions only involve one comparison and no type-casting (and thus I believe optimal):
unsigned saturate_add_uint(unsigned x, unsigned y)
{
if (y > UINT_MAX - x) return UINT_MAX;
return x + y;
}
unsigned short saturate_add_ushort(unsigned short x, unsigned short y)
{
if (y > USHRT_MAX - x) return USHRT_MAX;
return x + y;
}
As macros, they become:
SATURATE_ADD_UINT(x, y) (((y)>UINT_MAX-(x)) ? UINT_MAX : ((x)+(y)))
SATURATE_ADD_USHORT(x, y) (((y)>SHRT_MAX-(x)) ? USHRT_MAX : ((x)+(y)))
I leave versions for 'unsigned long' and 'unsigned long long' as an exercise to the reader. ;-)
An alternative to the branch free x86 asm solution is (AT&T syntax, a and b in eax and ebx, result in eax):
add %eax,%ebx
sbb $0,%ebx
int saturating_add(int x, int y)
{
int w = sizeof(int) << 3;
int msb = 1 << (w-1);
int s = x + y;
int sign_x = msb & x;
int sign_y = msb & y;
int sign_s = msb & s;
int nflow = sign_x && sign_y && !sign_s;
int pflow = !sign_x && !sign_y && sign_s;
int nmask = (~!nflow + 1);
int pmask = (~!pflow + 1);
return (nmask & ((pmask & s) | (~pmask & ~msb))) | (~nmask & msb);
}
This implementation doesn't use control flows, campare operators(==, !=) and the ?: operator. It just uses bitwise operators and logical operators.
Using C++ you could write a more flexible variant of Remo.D's solution:
template<typename T>
T sadd(T first, T second)
{
static_assert(std::is_integral<T>::value, "sadd is not defined for non-integral types");
return first > std::numeric_limits<T>::max() - second ? std::numeric_limits<T>::max() : first + second;
}
This can be easily translated to C - using the limits defined in limits.h. Please also note that the Fixed width integer types might not been available on your system.
//function-like macro to add signed vals,
//then test for overlow and clamp to max if required
#define SATURATE_ADD(a,b,val) ( {\
if( (a>=0) && (b>=0) )\
{\
val = a + b;\
if (val < 0) {val=0x7fffffff;}\
}\
else if( (a<=0) && (b<=0) )\
{\
val = a + b;\
if (val > 0) {val=-1*0x7fffffff;}\
}\
else\
{\
val = a + b;\
}\
})
I did a quick test and seems to work, but not extensively bashed it yet! This works with SIGNED 32 bit.
op : the editor used on the web page does not let me post a macro ie its not understanding non-indented syntax etc!
Saturation arithmetic is not standard for C, but it's often implemented via compiler intrinsics, so the most efficient way will not be the cleanest. You must add #ifdef blocks to select the proper way. MSalters's answer is the fastest for x86 architecture. For ARM you need to use __qadd16 function (ARM compiler) of _arm_qadd16 (Microsoft Visual Studio) for 16 bit version and __qadd for 32-bit version. They'll be automatically translated to one ARM instruction.
Links:
__qadd16
_arm_qadd16
__qadd
I'll add solutions that were not yet mentioned above.
There exists ADC instruction in Intel x86. It is represented as _addcarry_u32() intrinsic function. For ARM there should be similar intrinsic.
Which allows us to implement very fast uint32_t saturated addition for Intel x86:
Try it online!
#include <stdint.h>
#include <immintrin.h>
uint32_t add_sat_u32(uint32_t a, uint32_t b) {
uint32_t r, carry = _addcarry_u32(0, a, b, &r);
return r | (-carry);
}
Intel x86 MMX saturated addition instructions can be used to implement uint16_t variant:
Try it online!
#include <stdint.h>
#include <immintrin.h>
uint16_t add_sat_u16(uint16_t a, uint16_t b) {
return _mm_cvtsi64_si32(_mm_adds_pu16(
_mm_cvtsi32_si64(a),
_mm_cvtsi32_si64(b)
));
}
I don't mention ARM solution, as it can be implemented by other generic solutions from other answers.

Resources