How to implement SWAR unsigned less-than?

How to implement SWAR unsigned less-than? - c

I'm trying to use uint64_t as if it was 8 lanes of uint8_ts; my goal is to implement a lane-by-lane less-than. This operation, given x and y, should produce a result with 0xFF in a lane if the value for the corresponding lane in x is less than the value for that lane in y, and 0x00 otherwise. A lane-by-lane less-than-or-equal would also work.
Based on what I've seen, I'm guessing I would need a lanewise difference-or-zero operation (defined as doz(x, y) = if (x < y) then 0 else (x - y)), and then to use that to construct a selection mask. However, all the lane-wise subtraction approaches I've seen are signed, and I'm not sure how I would use them to do this kind of task.
Is there a way I could do this, using difference-or-zero or some other way?

Turns out basing it on DOZ is the wrong way to go after all. All of this is pointless, don't use it.
However, all the lane-wise subtraction approaches I've seen are signed
This is surprising, because subtraction is neither signed nor unsigned, there is only one subtraction and it can be interpreted both ways. At least, that's how it works in the 2's complement world.
For reference, SWAR subtraction looks like this: (source: SIMD and SWAR Techniques)
SWAR sub z = x - y
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
And DOZ could be based on that. A full DOZ is overkill though, if it was a primitive that would make sense. But SWAR DOZ would work by computing the difference, and then zeroing it out if x < y, which is the condition that we wanted all along. So let's just compute that and not the whole DOZ. That condition is based on this: when is there a borrow out of the high bit?
If the high bit of x is zero and the high bit of y is one.
If the high bits of x and y are the same, and the high bit of their difference is one. Equivalently: if the high bits of x and y are the same, and there is a borrow out of the second highest bit.
The first part of SWAR sub, ((x | H) - (y &~H)), computes (among other things) the borrow out of the second highest bit. The high bit of the SWAR difference is the inverse of the borrow out of the second highest bit (that bit from H either gets "eaten" by the borrow, or not).
Putting it together, SWAR unsigned-less-than could work like this:
tmp = ((~x ^ y) & ~((x | H) - (y &~H)) | (~x & y)) & H
less_than_mask = (tmp << 1) - (tmp >> 7)
Parts:
(~x ^ y) = mask of "bits are the same", used for "high bits are the same"
~((x | H) - (y &~H)) = difference of the low parts of elements, used for "borrow out of second highest bit"
(~x & y) = mask of "x is zero and y is one", used for "high bit of x is zero and high bit of y is one"
& H near the end, used to grab only the bits that correspond to the borrow out of the high bit
(tmp << 1) - (tmp >> 7) spreads out the bits grabbed by the previous step into lane-masks. Alternative: (tmp >> 7) * 255. This is the only step where the SWAR logic explicitly depends on the lane size, and it needs to be the same for every lane, even though for SWAR sub you could mix lane sizes.
One operation can be removed at the expression level by applying De Morgan's Rule:
tmp = (~(x ^ y | (x | H) - (y & ~H)) | ~x & y) & H
But ~x needs to be computed anyway, so at the assembly level that may not help, depending on how it gets compiled.
Perhaps some simplification is possible.

Which approach will work the fastest will depend on what kind of instructions are available in the processor architecture of the target platform, such as shift-plus, add, three-input adds, three-input logical instructions. It also depends on whether one desired a throughput- or latency-optimized version, and the superscalarity of the processor architecture.
The following ISO C 99 code provides two alternatives. One uses an unsigned byte-wise comparison taken directly from the literature (LTU_VARIANT = 1), the other (LTU_VARIANT = 0) I devised myself on the basis of a halving add (i.e. the sum of two integers divided by two, rounded down). This is based on the fact that for twos-complement integers a, b each in [0,255], a < u b ⇔ ~a + b >= 256.
However, this would require nine bits for the sum, so we can use a < u b ⇔ ((~a + b) >> 1) >= 128 instead, where the average can be computed within 8 bits by a well-known bit twiddling technique. The only processor architecture I know that offers a SIMD halving addition as a hardware instruction vhadd is Arm NEON.
I have included a test framework for a functional test, but benchmarking will be needed to establish which version performs better on a given platform.
Partial overlap of this answer with other answers is probable; there are only so many different ways to skin a kumquat.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#define LTU_VARIANT (1) // 0 or 1
#define UINT64_H8 (0x8080808080808080U) // byte-wise sign bits (MSBs)
uint64_t sign_to_mask8 (uint64_t a)
{
a = a & UINT64_H8; // isolate sign bits
a = a + a - (a >> 7); // extend them to full byte to create mask
return a;
}
uint64_t vhaddu8 (uint64_t a, uint64_t b)
{
/* Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
(A+B)/2 = (A AND B) + (A XOR B)/2.
*/
return (a & b) + (((a ^ b) >> 1) & ~UINT64_H8);
}
uint64_t ltu8_core (uint64_t a, uint64_t b)
{
/* Sebastiano Vigna, "Broadword implementation of rank/select queries."
In: International Workshop on Experimental and Efficient Algorithms,
pp. 154-168, Springer Berlin Heidelberg, 2008.
*/
return (((a | UINT64_H8) - (b & ~UINT64_H8)) | (a ^ b)) ^ (a | ~b);
}
uint64_t vcmpltu8 (uint64_t a, uint64_t b)
{
#if LTU_VARIANT==1
return sign_to_mask8 (ltu8_core (a, b));
#else // LTU_VARIANT
return sign_to_mask8 (vhaddu8 (~a, b));
#endif // LTU_VARIANT
}
uint64_t ref_func (uint64_t a, uint64_t b)
{
uint8_t a0 = (uint8_t)((a >> 0) & 0xff);
uint8_t a1 = (uint8_t)((a >> 8) & 0xff);
uint8_t a2 = (uint8_t)((a >> 16) & 0xff);
uint8_t a3 = (uint8_t)((a >> 24) & 0xff);
uint8_t a4 = (uint8_t)((a >> 32) & 0xff);
uint8_t a5 = (uint8_t)((a >> 40) & 0xff);
uint8_t a6 = (uint8_t)((a >> 48) & 0xff);
uint8_t a7 = (uint8_t)((a >> 56) & 0xff);
uint8_t b0 = (uint8_t)((b >> 0) & 0xff);
uint8_t b1 = (uint8_t)((b >> 8) & 0xff);
uint8_t b2 = (uint8_t)((b >> 16) & 0xff);
uint8_t b3 = (uint8_t)((b >> 24) & 0xff);
uint8_t b4 = (uint8_t)((b >> 32) & 0xff);
uint8_t b5 = (uint8_t)((b >> 40) & 0xff);
uint8_t b6 = (uint8_t)((b >> 48) & 0xff);
uint8_t b7 = (uint8_t)((b >> 56) & 0xff);
uint8_t r0 = (a0 < b0) ? 0xff : 0x00;
uint8_t r1 = (a1 < b1) ? 0xff : 0x00;
uint8_t r2 = (a2 < b2) ? 0xff : 0x00;
uint8_t r3 = (a3 < b3) ? 0xff : 0x00;
uint8_t r4 = (a4 < b4) ? 0xff : 0x00;
uint8_t r5 = (a5 < b5) ? 0xff : 0x00;
uint8_t r6 = (a6 < b6) ? 0xff : 0x00;
uint8_t r7 = (a7 < b7) ? 0xff : 0x00;
return ( ((uint64_t)r0 << 0) +
((uint64_t)r1 << 8) +
((uint64_t)r2 << 16) +
((uint64_t)r3 << 24) +
((uint64_t)r4 << 32) +
((uint64_t)r5 << 40) +
((uint64_t)r6 << 48) +
((uint64_t)r7 << 56) );
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res, ref, n = 0;
printf ("Testing vcmpltu8: byte-wise unsigned comparison with mask result\n");
printf ("using LTU variant %d\n", LTU_VARIANT);
do {
a = KISS64;
b = KISS64;
res = vcmpltu8 (a, b);
ref = ref_func (a, b);
if (res != ref) {
printf ("\nerr # a=%016" PRIx64 " b=%016" PRIx64 " : res=%016" PRIx64 " ref=%016" PRIx64 "\n",
a, b, res, ref);
return EXIT_FAILURE;
}
n++;
if (!(n & 0xffffff)) printf ("\r%016" PRIx64, n);
} while (a);
printf ("\ntest passed\n");
return EXIT_SUCCESS;
}

Here's an architecture-independent approach. I'm sure it could use refinement, but it seems to be working fine. With x86 gcc/clang, it compiles to 20/19 instructions.
The idea is to first solve the problem when both bytes are either less than 128 or not, setting bit 7 in each byte with that result. Then patch up the other cases. Finally smear the bit 7's downward.
#include <stdio.h>
#include <stdint.h>
uint64_t bwlt(uint64_t a, uint64_t b) {
uint64_t lo7 = ~0ull / 255 * 127, // low 7 bits set in each byte
alo7 = a & lo7, // mask low 7 bits in a
blo7 = b & lo7, // mask low 7 bits in b
r = (lo7 - alo7 + blo7) & ~lo7, // set 8th bits with a < b
diff = (a ^ b) & ~lo7; // 8th bits that differ
r &= ~(a & diff); // unset if a[i]_7=1,b[i]_7=0
r |= b & diff; // set if a[i]_7=0,b[i]_7=1
return (r << 1) - (r >> 7);
}
int main(void) {
uint64_t a = 0x11E1634052A6B7CB;
uint64_t b = 0x1EAEF1E85F26734E;
printf("r=%016llx\n", bwlt(a, b));
return 0;
}
One test case:
$ gcc foo.c -o foo
$ ./foo
r=ff00ffffff000000

I enjoyed figuring out how to create the SWAR x LT (Less Than) y function with 64bit unsigned int and using only logical operators and arithmetic + and -.
I looked at some information on the web (https://www.chessprogramming.org/SIMD_and_SWAR_Techniques) and from there I got the idea that the function can be done starting from the subtraction (x - y).
Looking at the meaning of the highest bit of: x, y and (x - y) when unsigned int are used, I created the following truth table where:
R (result) is 1 when the LT condition occurs.
D is the highest bit of the subtracion (x-y),
X is the highest bit of the X value to be tested,
Y is the highest bit of the Y value to be tested.
D X Y | R
0 0 0 | 0
0 0 1 | 1
0 1 0 | 0
0 1 1 | 0
1 0 0 | 1
1 0 1 | 1
1 1 0 | 0
1 1 1 | 1
Applying the Karnaugh's map (https://getcalc.com/karnaugh-map/3variable-kmap-solver.htm) to the table above we obtain the following formula:
(~ X & Y) | (D & ~ X) | (D & Y)
from which the macro SWARLTU(x, y) arose (see file swar.h below).
Since I was not satisfied, I observed how the compiler generated the assembler code of the macro SWARLTU and then following that code I wrote the macro SWARLTU2(x, y) (see file swar.h below). This last macro should be logically optimized.
The limit of this code is that the value for the LT result is 0x80 and not 0xFF as requested in the question.
The program can be launched in three different ways:
Without parameters, in this case it will perform 10 tests on random numbers.
With only one parameter, the parameter will indicate the number of random tests to be performed.
With two parameters, two numbers in the form 0xnnnnn, in this case only the control of the entered values will be shown.
Here the code:
The file swar.h (this file contains also other SWAR macros E.G.: SHL, SHR)
#ifndef SWAR_H
#define SWAR_H
/*
https://www.chessprogramming.org/SIMD_and_SWAR_Techniques
SWAR add z = x + y
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x - y
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
z = (x & y) + (((x ^ y) & ~L) >> 1)
*/
// 0 1 2 3 4 5 6 7
#define SWARH 0x8080808080808080LL
#define SWARL 0x0101010101010101LL
#define SWARADD(x,y) \
((( (x) &~SWARH) + ( (y) &~SWARH)) ^ (( (x) ^ (y) ) & SWARH))
#define SWARSUB(x,y) \
((( (x) | SWARH) - ( (y) &~SWARH)) ^ (( (x) ^~(y) ) & SWARH))
#define SWARAVE(x,y) \
(( (x) & (y) ) + ((( (x) ^ (y)) & ~SWARL) >> 1))
#define SWARLTI(x,y) \
( SWARSUB(x,y) & SWARH )
#define SWARSHL(x) \
(((x)&(~SWARH))<<1)
#define SWARSHR(x) \
(((x)&(~SWARL))>>1)
/*** Computing unsigned less than
Truth table considering the HIGH bit setting of
Differece, X Value, Y Value
D X Y | R
0 0 0 | 0
0 0 1 | 1
0 1 0 | 0
0 1 1 | 0
1 0 0 | 1
1 0 1 | 1
1 1 0 | 0
1 1 1 | 1
***/
#define _SWARDH (SWARSUB(x,y) & SWARH)
#define _SWARXH ((x)&SWARH)
#define _SWARYH ((y)&SWARH)
#define SWARLTU(x,y) \
((~_SWARXH & _SWARYH) | (_SWARDH & ~_SWARXH) | (_SWARDH & _SWARYH))
// Elaborated from the generated ASM of the previous.
#define SWARLTU2(X,Y) \
((((~(X & SWARH)) & ((((~(X ^ Y)) & SWARH) ^ ((X | SWARH) - Y)) | Y)) | \
((((~(X ^ Y)) & SWARH) ^ ((X | SWARH) - Y)) & Y)) & SWARH)
#endif // SWAR_H
The file main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <inttypes.h>
#include <time.h>
#include "swar.h"
char * verifyltu(char * rs,uint64_t x, uint64_t y, uint64_t v);
void printvalues(uint64_t x,uint64_t y,uint64_t r,uint64_t r1);
uint64_t rand64();
int main(int argc, char *argv[])
{
int rnd=1;
size_t i,n=10;
uint64_t x=0,y=0,r,r1;
srand(time(NULL));
if (argc>1) {
if (argc==2) {
n=strtoul(argv[1],NULL,0);
} else {
x=strtoull(argv[1],NULL,0);
y=strtoull(argv[2],NULL,0);
rnd=0;
}
}
if (rnd) {
for(i=0;i<n;i++) {
x=rand64();
y=rand64();
r=SWARLTU(x,y);
r1=SWARLTU2(x,y);
printvalues(x,y,r,r1);
}
} else {
r=SWARLTU(x,y);
r1=SWARLTU2(x,y);
printvalues(x,y,r,r1);
}
return 0;
}
char * verifyltu(char * rs,uint64_t x, uint64_t y, uint64_t v)
{
size_t i;
uint8_t *xs, *ys, *vs;
xs=(uint8_t *)&x; ys=(uint8_t *)&y;
vs=(uint8_t *)&v;
for(i=0;i<sizeof(uint64_t);i++) {
if ( ( xs[i]<ys[i] && vs[i]&0x80) ||
( !(xs[i]<ys[i]) && !(vs[i]&0x80) ) )
{
rs[i*2]='*';rs[i*2+1]=' ';
} else {
rs[i*2]='-';rs[i*2+1]=' ';
}
}
rs[i*2]=0;
return rs;
}
void printvalues(uint64_t x,uint64_t y,uint64_t r,uint64_t r1)
{
char rs[17],rs1[17];
printf(
"X %016" PRIX64 " <\n"
"Y %016" PRIX64 "\n"
" ----------------\n"
"LTU %016" PRIX64 "\n"
"*=Ok %s\n"
"LTU2 %016" PRIX64 "\n"
"*=Ok %s\n\n",
x,y,
r,verifyltu(rs,x,y,r),
r1,verifyltu(rs1,x,y,r1)
);
}
uint64_t rand64()
{
uint64_t x;
x=rand(); x=(x<<32)+rand();
return x;
}

I came up with
uint64_t magic(uint64_t a, uint64_t b) {
auto H = 0x8080808080808080ull;
auto c = (a|H) - (b&(~H));
auto d = a^b;
auto e = ((a & d) | (c & (~d))) & H;
return e ^ H;
}
The logic goes pretty much the same path as in Harold's; the difference is in interpreting the top bits as
c = 1aaaaaaa -> the carry to the H bit is 0 only IFF a<b
0bbbbbbb
a = 80, b = 00 different sign -> select b, i.e. ~a
a = 00, b = 80 different sign -> select b, i.e. ~a
a = 80, b = 80 same sign -> select ~c
a = 00, b = 00 same sign -> select ~c
If one could work with inverted mask (i.e. b >= a), then the last ^H can be omitted as well.
The instruction count of the results using clang / godbolt for arm64 and x64 would be with and (without) sign_to_mask.
instructions arm64 x64
-------------+----------+---------
vhaddu8 | 5 (8) | 8 (13)
ltu8_core | 7 (10)| 11 (15)
magic | 9 (10)| 12 (15)
harold | 9 (11)| 14 (17)
bwlt | 10 (12)| 15 (18)
SirJoBlack | 11 (13)| 16 (19)

Related

Wrong results multiplying two 32 bit numbers in C

I am trying two multiply to matrices in C and I cant understand why I get these results...
I want to do : Btranspose * B
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
#define LOW_WORD(x) (((x) << 16) >> 16)
#define HIGH_WORD(x) ((x) >> 16)
#define ABS(x) (((x) >= 0) ? (x) : -(x))
#define SIGN(x) (((x) >= 0) ? 1 : -1)
#define UNSIGNED_MULT(a, b) \
(((LOW_WORD(a) * LOW_WORD(b)) << 0) + \
(((int64_t)((LOW_WORD((a)) * HIGH_WORD((b))) + (HIGH_WORD((a)) * LOW_WORD((b))))) << 16) + \
((int64_t)(HIGH_WORD((a)) * HIGH_WORD((b))) << 32))
#define MULT(a, b) (UNSIGNED_MULT(ABS((a)), ABS((b))) * SIGN((a)) * SIGN((b)))
int main()
{
int c,d,k;
int64_t multmatrix[3][3];
int64_t sum64 = 0;
int32_t Btranspose[3][3] = {{15643, 24466, 58751},
{54056, 26823, -25563},
{-33591, 54561, -13777}};
int32_t B[3][3] = {{15643, 54056, -33591},
{24466, 26823, 54561},
{58751, -25563, -13777}};
for ( c = 0 ; c < 3 ; c++ ){
for ( d = 0 ; d < 3 ; d++ ){
for ( k = 0 ; k < 3 ; k++ ){
sum64 = sum64 + MULT(Btranspose[c][k], B[k][d]);
printf("\n the MULT for k = %d is: %ld \n", k, MULT(Btranspose[c][k], B[k][d]));
printf("\n the sum for k = %d is: %ld \n", k, sum64);
}
multmatrix[c][d] = sum64;
sum64 = 0;
}
}
printf("\n\n multmatrix \n");
for( c = 0 ; c < 3; c++ ){
printf("\n");
for( d = 0 ; d < 3 ; d++ ){
printf(" %ld ", multmatrix[c][d]);
}
}
return 0;
}
My output is below put that is wrong and I notice that the mistake is when is multiplying the 3rd element (58751 * 58751) for k=2.
I think is not overflowing because 58751^2 needs 32bits.
the MULT for k = 0 is: 244703449
the sum for k = 0 is: 244703449
the MULT for k = 1 is: 598585156
the sum for k = 1 is: 843288605
the MULT for k = 2 is: 46036225 // this is WRONG!!!
the sum for k = 2 is: 889324830
.
.
.
.
the MULT for k = 2 is: 189805729
the sum for k = 2 is: 1330739379
multmatrix
889324830 650114833 324678230
650114833 1504730698 -308929574
324678230 -308929574 1330739379
Correct result should be
multmatrix - correct
4.2950e+09 -2.2870e+03 1.2886e+04
-2.2870e+03 4.2950e+09 -1.2394e+05
1.2886e+04 -1.2394e+05 4.2951e+09
Why is the multiplication of the matrix wrong??
What should I change the above code so that the multiplication of two matrices will be overflow-proof??
(I am trying write a program that multiplies two 32 bits numbers to be imported on a system that has only 32 bit registers)
So according to the answer below this actually works.
#define LOW_WORD(x) ((uint32_t)(x) & 0xffff)
#define HIGH_WORD(x) ((uint32_t)(x) >> 16)
#define ABS(x) (((x) >= 0) ? (x) : -(x))
#define SIGN(x) (((x) >= 0) ? 1 : -1)
#define UNSIGNED_MULT(a, b) \
(((LOW_WORD(a) * LOW_WORD(b)) << 0) + \
((int64_t)(LOW_WORD(a) * HIGH_WORD(b) + HIGH_WORD(a) * LOW_WORD(b)) << 16) + \
((int64_t)(HIGH_WORD((a)) * HIGH_WORD((b))) << 32))
#define MULT(a, b) (UNSIGNED_MULT(ABS((a)), ABS((b))) * SIGN((a)) * SIGN((b)))
Thank you for helping me understand some things! I'll try turning the whole thing to functions and posting it back.

This
(((x) << 16) >> 16)
doesn't produce unsigned 16-bit number, as you might expect. The type of this expression is the same as the type of x, which is int32_t (signed integer). Indeed, if using any sensible (two's complement) C implementation, for x=58751:
x = 00000000000000001110010101111111
(x) << 16 = 11100101011111110000000000000000 (negative number)
(((x) << 16) >> 16) = 11111111111111111110010101111111 (negative number)
To extract the low 16 bits properly, use unsigned arithmetic:
((uint32_t)(x) & 0xffff)
or (preserving your style)
((uint32_t)(x) << 16 >> 16)
To get the high word, you have to use unsigned arithmetic too:
((uint32_t)(x) >> 16)
Also, the compiler might need help determining the range of this expression (to do optimizations):
(uint16_t)((uint32_t)(x) & 0xffff)
Some (all?) compilers are smart enough to do that by themselves though.
Also, as noted by doynax, the product of low word and high word is a 32-bit number (or 31-bit, but it doesn't matter). To shift it left by 16 bits, you have to cast it to a 64-bit type, just like you do it with the high words:
((int64_t)(LOW_WORD(a) * HIGH_WORD(b) + HIGH_WORD(a) * LOW_WORD(b)) << 16)

How to interleave 2 booleans using bitwise operators?

Suppose I have two 4-bit values, ABCD and abcd. How to interleave it, so it becomes AaBbCcDd, using bitwise operators? Example in pseudo-C:
nibble a = 0b1001;
nibble b = 0b1100;
char c = foo(a,b);
print_bits(c);
// output: 0b11010010
Note: 4 bits is just for illustration, I want to do this with two 32bit ints.

This is called the perfect shuffle operation, and it's discussed at length in the Bible Of Bit Bashing, Hacker's Delight by Henry Warren, section 7-2 "Shuffling Bits."
Assuming x is a 32-bit integer with a in its high-order 16 bits and b in its low-order 16 bits:
unsigned int x = (a << 16) | b; /* put a and b in place */
the following straightforward C-like code accomplishes the perfect shuffle:
x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;
He also gives an alternative form which is faster on some CPUs, and (I think) a little more clear and extensible:
unsigned int t; /* an intermediate, temporary variable */
t = (x ^ (x >> 8)) & 0x0000FF00; x = x ^ t ^ (t << 8);
t = (x ^ (x >> 4)) & 0x00F000F0; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 2)) & 0x0C0C0C0C; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 1)) & 0x22222222; x = x ^ t ^ (t << 1);
I see you have edited your question to ask for a 64-bit result from two 32-bit inputs. I'd have to think about how to extend Warren's technique. I think it wouldn't be too hard, but I'd have to give it some thought. If someone else wanted to start here and give a 64-bit version, I'd be happy to upvote them.
EDITED FOR 64 BITS
I extended the second solution to 64 bits in a straightforward way. First I doubled the length of each of the constants. Then I added a line at the beginning to swap adjacent double-bytes and intermix them. In the following 4 lines, which are pretty much the same as the 32-bit version, the first line swaps adjacent bytes and intermixes, the second line drops down to nibbles, the third line to double-bits, and the last line to single bits.
unsigned long long int t; /* an intermediate, temporary variable */
t = (x ^ (x >> 16)) & 0x00000000FFFF0000ull; x = x ^ t ^ (t << 16);
t = (x ^ (x >> 8)) & 0x0000FF000000FF00ull; x = x ^ t ^ (t << 8);
t = (x ^ (x >> 4)) & 0x00F000F000F000F0ull; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 2)) & 0x0C0C0C0C0C0C0C0Cull; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 1)) & 0x2222222222222222ull; x = x ^ t ^ (t << 1);

From Stanford "Bit Twiddling Hacks" page:
https://graphics.stanford.edu/~seander/bithacks.html#InterleaveTableObvious
uint32_t x = /*...*/, y = /*...*/;
uint64_t z = 0;
for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
{
z |= (x & 1U << i) << i | (y & 1U << i) << (i + 1);
}
Look at the page they propose different and faster algorithms to achieve the same.

Like so:
#include <limits.h>
typedef unsigned int half;
typedef unsigned long long full;
full mix_bits(half a,half b)
{
full result = 0;
for (int i=0; i<sizeof(half)*CHAR_BIT; i++)
result |= (((a>>i)&1)<<(2*i+1))|(((b>>i)&1)<<(2*i+0));
return result;
}

Here is a loop-based solution that is hopefully more readable than some of the others already here.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t interleave(uint32_t a, uint32_t b) {
uint64_t result = 0;
int i;
for (i = 0; i < 31; i++) {
result |= (a >> (31 - i)) & 1;
result <<= 1;
result |= (b >> (31 - i)) & 1;
result <<= 1;
}
// Skip the last left shift.
result |= (a >> (31 - i)) & 1;
result <<= 1;
result |= (b >> (31 - i)) & 1;
return result;
}
void printBits(uint64_t a) {
int i;
for (i = 0; i < 64; i++)
printf("%lu", (a >> (63 - i)) & 1);
puts("");
}
int main(){
uint32_t a = 0x9;
uint32_t b = 0x6;
uint64_t c = interleave(a,b);
printBits(a);
printBits(b);
printBits(c);
}

I have used the 2 tricks/operations used in this post How do you set, clear, and toggle a single bit? of setting a bit at particular index and checking the bit at particular index.
The following code is implemented using these 2 operations only.
int a = 0b1001;
int b = 0b1100;
long int c=0;
int index; //To specify index of c
int bit,i;
//Set bits in c from right to left.
for(i=32;i>=0;i--)
{
index=2*i+1; //We have to add the bit in c at this index
//Check a
bit=a&(1<<i); //Checking whether the i-th bit is set in a
if(bit)
c|=1<<index; //Setting bit in c at index
index--;
//Check b
bit=b&(1<<i); //Checking whether the i-th bit is set in b
if(bit)
c|=1<<index; //Setting bit in c at index
}
printf("%ld",c);
Output: 210 which is 0b11010010

Is there a more efficient way of expanding a char to an uint64_t?

I want to inflate an unsigned char to an uint64_t by repeating each bit 8 times. E.g.
char -> uint64_t
0x00 -> 0x00
0x01 -> 0xFF
0x02 -> 0xFF00
0x03 -> 0xFFFF
0xAA -> 0xFF00FF00FF00FF00
I currently have the following implementation, using bit shifts to test if a bit is set, to accomplish this:
#include <stdint.h>
#include <inttypes.h>
#define BIT_SET(var, pos) ((var) & (1 << (pos)))
static uint64_t inflate(unsigned char a)
{
uint64_t MASK = 0xFF;
uint64_t result = 0;
for (int i = 0; i < 8; i++) {
if (BIT_SET(a, i))
result |= (MASK << (8 * i));
}
return result;
}
However, I'm fairly new to C, so this fiddling with individual bits makes me a little vary that there might be a better (i.e. more efficient) way of doing this.
EDIT TO ADD
Ok, so after trying out the table lookup solution, here are the results. However, keep in mind that I didn't test the routine directly, but rather as part of bigger function (a multiplication of binary matrices to be precise), so this might have affected how the results turned out. So, on my computer, when multiplying a million 8x8 matrices, and compiled with:
gcc -O2 -Wall -std=c99 foo.c
I got
./a.out original
real 0m0.127s
user 0m0.124s
sys 0m0.000s
./a.out table_lookup
real 0m0.012s
user 0m0.012s
sys 0m0.000s
So at least on my machine (a virtual machine 64 bit Linux Mint I should mention), the table lookup approach seems to provide a roughly 10-times speed-up, so I will accept that as the answer.

If you're looking for efficiency use a lookup table: a static array of 256 entries, each already holding the required result. You can use your code above to generate it.

In selected architectures (SSE,Neon) there are fast vector operations that can speed up this task or are designed to do this. Without special instructions the suggested look up table approach is both the fastest and most portable.
If the 2k size is an issue, parallel vector arithmetic operations can be simulated:
static uint64_t inflate_parallel(unsigned char a) {
uint64_t vector = a * 0x0101010101010101ULL;
// replicate the word all over qword
// A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
vector &= 0x8040201008040201; // becomes 80 00 20 00 00 04 00 01 <--
vector += 0x00406070787c7e7f; // becomes 80 40 80 70 78 80 7e 80
// MSB is correct
vector = (vector >> 7) & 0x0101010101010101ULL; // LSB is correct
return vector * 255; // all bits correct
}
EDIT: 2^31 iterations, (four time unroll to mitigate loop evaluation)
time ./parallel time ./original time ./lookup
real 0m2.038s real 0m14.161s real 0m1.436s
user 0m2.030s user 0m14.120s user 0m1.430s
sys 0m0.000s sys 0m0.000s sys 0m0.000s
That's about 7x speedup, while the lookup table gives ~10x speedup

You should profile what your code does, before worrying about optimising it.
On my compiler locally, your code gets entirely inlined, unrolled and turned into 8 constant test + or instructions when the value is unknown, and turned into a constant when the value is known at compile time. I could probably marginally improve it by removing a few branches, but the compiler is doing a reasonable job on its own.
Optimising the loop is then a bit pointless. A table lookup might be more efficient, but would probably prevent the compiler from making optimisations itself.

The desired functionality can be achieved by moving each bit of the source into the lsb of the appropriate target byte (0 → 0, 1 → 8, 2 → 16, ...., 7 → 56), then expanding each lsb to cover the whole byte, which is easily done by multiplying with 0xff (255). Instead of moving bits into place individually using shifts, then combining the results, we can use an integer multiply to shift multiple bits in parallel. To prevent self-overlap, we can move only the least-significant seven source bits in this fashion, but need to move the source msb separately with a shift.
This leads to the following ISO-C99 implementation:
#include <stdint.h>
/* expand each bit in input into one byte in output */
uint64_t fast_inflate (uint8_t a)
{
const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) |
(1ULL << 14) | (1ULL << 7) | (1UL << 0);
const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
(1ULL << 24) | (1ULL << 16) | (1ULL << 8) | (1ULL << 0);
uint64_t r;
/* spread bits to lsbs of each byte */
r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
/* extract the lsbs of all bytes */
r = r & byte_lsb;
/* fill each byte with its lsb */
r = r * 0xff;
return r;
}
#define BIT_SET(var, pos) ((var) & (1 << (pos)))
static uint64_t inflate(unsigned char a)
{
uint64_t MASK = 0xFF;
uint64_t result = 0;
for (int i = 0; i < 8; i++) {
if (BIT_SET(a, i))
result |= (MASK << (8 * i));
}
return result;
}
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
uint8_t a = 0;
do {
uint64_t res = fast_inflate (a);
uint64_t ref = inflate (a);
if (res != ref) {
printf ("error # %02x: fast_inflate = %016llx inflate = %016llx\n",
a, res, ref);
return EXIT_FAILURE;
}
a++;
} while (a);
printf ("test passed\n");
return EXIT_SUCCESS;
}
Most x64 compilers will compile fast_inflate() in straightforward manner. For example, my Intel compiler Version 13.1.3.198, when building with /Ox, generates the 11-instruction sequence below. Note that the final multiply with 0xff is actually implemented as a shift and subtract sequence.
fast_inflate PROC
mov rdx, 040810204081H
movzx r9d, cl
and ecx, 127
mov r8, 0101010101010101H
imul rdx, rcx
shl r9, 49
add r9, rdx
and r9, r8
mov rax, r9
shl rax, 8
sub rax, r9
ret

If you're willing to spend 256 * 8 = 2kB of memory on this (i.e. become less efficient in terms of memory, but more efficient in terms of CPU cycles needed), the most efficient way would be to pre-compute a lookup table:
static uint64_t inflate(unsigned char a) {
static const uint64_t charToUInt64[256] = {
0x0000000000000000, 0x00000000000000FF, 0x000000000000FF00, 0x000000000000FFFF,
// ...
};
return charToUInt64[a];
}

Here is one more method using only simple arithmetics:
uint64_t inflate_chqrlie(uint8_t value) {
uint64_t x = value;
x = (x | (x << 28));
x = (x | (x << 14));
x = (x | (x << 7)) & 0x0101010101010101ULL;
x = (x << 8) - x;
return x;
}
Another very efficient and concise one by phuclv using multiplication and mask:
static uint64_t inflate_phuclv(uint8_t b) {
uint64_t MAGIC = 0x8040201008040201ULL;
uint64_t MASK = 0x8080808080808080ULL;
return ((MAGIC * b) & MASK) >> 7;
}
And another with a small lookup table:
static uint32_t const lut_4_32[16] = {
0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF,
0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF,
0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF,
0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF,
};
static uint64_t inflate_lut32(uint8_t b) {
return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}
I wrote a benchmarking program to determine relative performance of the different approaches on my system (x86_64-apple-darwin16.7.0, Apple LLVM version 9.0.0 (clang-900.0.39.2, clang -O3).
The results show that my function inflate_chqrlie is faster than naive approaches but slower than other elaborate versions, all of which are beaten hands down by inflate_lut64 using a 2KB the lookup table in cache optimal situations.
The function inflate_lut32, using a much smaller lookup table (64 bytes instead of 2KB) is not as fast as inflate_lut64, but seems a good compromise for 32-bit architectures as it is still much faster than all other alternatives.
64-bit benchmark:
inflate: 0, 848.316ms
inflate_Curd: 0, 845.424ms
inflate_chqrlie: 0, 371.502ms
fast_inflate_njuffa: 0, 288.669ms
inflate_parallel1: 0, 242.827ms
inflate_parallel2: 0, 315.105ms
inflate_parallel3: 0, 363.379ms
inflate_parallel4: 0, 304.051ms
inflate_parallel5: 0, 301.205ms
inflate_phuclv: 0, 109.130ms
inflate_lut32: 0, 197.178ms
inflate_lut64: 0, 25.160ms
32-bit benchmark:
inflate: 0, 1451.464ms
inflate_Curd: 0, 955.509ms
inflate_chqrlie: 0, 385.036ms
fast_inflate_njuffa: 0, 463.212ms
inflate_parallel1: 0, 468.070ms
inflate_parallel2: 0, 570.107ms
inflate_parallel3: 0, 511.741ms
inflate_parallel4: 0, 601.892ms
inflate_parallel5: 0, 506.695ms
inflate_phuclv: 0, 192.431ms
inflate_lut32: 0, 140.968ms
inflate_lut64: 0, 28.776ms
Here is the code:
#include <stdio.h>
#include <stdint.h>
#include <time.h>
static uint64_t inflate(unsigned char a) {
#define BIT_SET(var, pos) ((var) & (1 << (pos)))
uint64_t MASK = 0xFF;
uint64_t result = 0;
for (int i = 0; i < 8; i++) {
if (BIT_SET(a, i))
result |= (MASK << (8 * i));
}
return result;
}
static uint64_t inflate_Curd(unsigned char a) {
uint64_t mask = 0xFF;
uint64_t result = 0;
for (int i = 0; i < 8; i++) {
if (a & 1)
result |= mask;
mask <<= 8;
a >>= 1;
}
return result;
}
uint64_t inflate_chqrlie(uint8_t value) {
uint64_t x = value;
x = (x | (x << 28));
x = (x | (x << 14));
x = (x | (x << 7)) & 0x0101010101010101ULL;
x = (x << 8) - x;
return x;
}
uint64_t fast_inflate_njuffa(uint8_t a) {
const uint64_t spread7 = (1ULL << 42) | (1ULL << 35) | (1ULL << 28) | (1ULL << 21) |
(1ULL << 14) | (1ULL << 7) | (1UL << 0);
const uint64_t byte_lsb = (1ULL << 56) | (1ULL << 48) | (1ULL << 40) | (1ULL << 32) |
(1ULL << 24) | (1ULL << 16) | (1ULL << 8) | (1ULL << 0);
uint64_t r;
/* spread bits to lsbs of each byte */
r = (((uint64_t)(a & 0x7f) * spread7) + ((uint64_t)a << 49));
/* extract the lsbs of all bytes */
r = r & byte_lsb;
/* fill each byte with its lsb */
r = r * 0xff;
return r;
}
// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
uint64_t vector = a * 0x0101010101010101ULL;
// replicate the word all over qword
// A5 becomes A5 A5 A5 A5 A5 A5 A5 A5
vector &= 0x8040201008040201; // becomes 80 00 20 00 00 04 00 01 <--
vector += 0x00406070787c7e7f; // becomes 80 40 80 70 78 80 7e 80
// MSB is correct
vector = (vector >> 7) & 0x0101010101010101ULL; // LSB is correct
return vector * 255; // all bits correct
}
// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
uint64_t vector1 = a * 0x0002000800200080ULL;
uint64_t vector2 = a * 0x0000040010004001ULL;
uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
return vector * 255;
}
// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
uint32_t vector1 = (( (a & 0x0F) * 0x00204081) & 0x01010101) * 255;
uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
return (((uint64_t)vector2) << 32) | vector1;
}
// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
uint32_t vector1 = (a & 0x0F) * 0x00204081;
uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
return vector * 255;
}
// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
uint32_t vector1 = (a & 0x0F) * 0x00204081;
uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
return vector * 255;
}
static uint64_t inflate_phuclv(uint8_t b) {
uint64_t MAGIC = 0x8040201008040201ULL;
uint64_t MASK = 0x8080808080808080ULL;
return ((MAGIC * b) & MASK) >> 7;
}
static uint32_t const lut_4_32[16] = {
0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF,
0x00FF0000, 0x00FF00FF, 0x00FFFF00, 0x00FFFFFF,
0xFF000000, 0xFF0000FF, 0xFF00FF00, 0xFF00FFFF,
0xFFFF0000, 0xFFFF00FF, 0xFFFFFF00, 0xFFFFFFFF,
};
static uint64_t inflate_lut32(uint8_t b) {
return lut_4_32[b & 15] | ((uint64_t)lut_4_32[b >> 4] << 32);
}
static uint64_t lut_8_64[256];
static uint64_t inflate_lut64(uint8_t b) {
return lut_8_64[b];
}
#define ITER 1000000
int main() {
clock_t t;
uint64_t x;
for (int b = 0; b < 256; b++)
lut_8_64[b] = inflate((uint8_t)b);
#define TEST(func) do { \
t = clock(); \
x = 0; \
for (int i = 0; i < ITER; i++) { \
for (int b = 0; b < 256; b++) \
x ^= func((uint8_t)b); \
} \
t = clock() - t; \
printf("%20s: %llu, %.3fms\n", \
#func, x, t * 1000.0 / CLOCKS_PER_SEC); \
} while (0)
TEST(inflate);
TEST(inflate_Curd);
TEST(inflate_chqrlie);
TEST(fast_inflate_njuffa);
TEST(inflate_parallel1);
TEST(inflate_parallel2);
TEST(inflate_parallel3);
TEST(inflate_parallel4);
TEST(inflate_parallel5);
TEST(inflate_phuclv);
TEST(inflate_lut32);
TEST(inflate_lut64);
return 0;
}

Variations on the same theme as #Aki answer. Some of them are better here, but it may depend on your compiler and target machines (they should be more suitable for superscalar processor that Aki's function even if they do more work as there is less data dependencies)
// Aki Suuihkonen: 1.265
static uint64_t inflate_parallel1(unsigned char a) {
uint64_t vector = a * 0x0101010101010101ULL;
vector &= 0x8040201008040201;
vector += 0x00406070787c7e7f;
vector = (vector >> 7) & 0x0101010101010101ULL;
return vector * 255;
}
// By seizet and then combine: 1.583
static uint64_t inflate_parallel2(unsigned char a) {
uint64_t vector1 = a * 0x0002000800200080ULL;
uint64_t vector2 = a * 0x0000040010004001ULL;
uint64_t vector = (vector1 & 0x0100010001000100ULL) | (vector2 & 0x0001000100010001ULL);
return vector * 255;
}
// Stay in 32 bits as much as possible: 1.006
static uint64_t inflate_parallel3(unsigned char a) {
uint32_t vector1 = (( (a & 0x0F) * 0x00204081) & 0x01010101) * 255;
uint32_t vector2 = ((((a & 0xF0) >> 4) * 0x00204081) & 0x01010101) * 255;
return (((uint64_t)vector2) << 32) | vector1;
}
// Do the common computation in 64 bits: 0.915
static uint64_t inflate_parallel4(unsigned char a) {
uint32_t vector1 = (a & 0x0F) * 0x00204081;
uint32_t vector2 = ((a & 0xF0) >> 4) * 0x00204081;
uint64_t vector = (vector1 | (((uint64_t)vector2) << 32)) & 0x0101010101010101ULL;
return vector * 255;
}
// Some computation is done in 64 bits a little sooner: 0.806
static uint64_t inflate_parallel5(unsigned char a) {
uint32_t vector1 = (a & 0x0F) * 0x00204081;
uint64_t vector2 = (a & 0xF0) * 0x002040810000000ULL;
uint64_t vector = (vector1 | vector2) & 0x0101010101010101ULL;
return vector * 255;
}

Two minor optimizations:
One for testing the bits in the input (a will be destroyed but this doesn't matter)
The other for shifting the mask.
static uint64_t inflate(unsigned char a)
{
uint64_t mask = 0xFF;
uint64_t result = 0;
for (int i = 0; i < 8; i++) {
if (a & 1)
result |= mask;
mask <<= 8;
a >>= 1;
}
return result;
}
Maybe you can also replace the 'for (int i = 0; i < 8; i++)'-loop by a
'while (a)'-loop.
This works, however, only if the right shift a >>=1 works unsigned
(As much as I know C standard allows the compiler to do it signed or unsigned).
Otherwise you will have an infinite loop in some cases.
EDIT:
To see the result I compiled both variants with gcc -std=c99 -S source.c.
A quick glance at the resulting assembler outputs shows that the optimization shown above yields ca. 1/3 viewer instructions, most of them inside the loop.

retrieve byte from 32 bit integer using bitwise operators

Here is the problem and what I currently have, I just don't understand how it is wrong...
getByte - Extract byte n from word x Bytes numbered from 0 (LSB) to
3 (MSB) Examples: getByte(0x12345678,1) = 0x56 Legal ops: ! ~ &
^ | + << >> Max ops: 6 Rating: 2
int getByte(int x, int n) {
return ((x << (24 - 8 * n)) >> (8 * n));
}

Your shifting doesn't make any sense - first, you shift left by (24 - 8n) bits, then you shift back right by 8n bits. Why? Also, it's wrong. If n is 0, you shift x left by 24 bits and return that value. Try pen and paper to see that this is entirely wrong.
The correct approach would be to do:
int getByte(int x, int n) {
return (x >> 8*n) & 0xFF;
}

Unless i am totally mistaken, your code is mathematically incorrect.
getByte(0x000000ff, 0) {
24 - 8 * n = 24;
8 * n = 0;
0x000000ff << 24 = 0xff000000;
0xff000000 >> 0 = 0xff000000;
return 0xff000000; // should return 0xff
}
Not being allowed to use operators - and especially * is a problem (can't do * 8). I came up with this:
uint8_t getByte (uint32_t x, int n) {
switch (n) {
case 0:
return x & 0xff;
case 1:
return (x >> 8) & 0xff;
case 2:
return (x >> 16) & 0xff;
case 3:
return x >> 24;
}
}
Not exactly beautiful, but it conforms to the problem description: 6 operators, all of them legal.
EDIT: Just had a (pretty obvious) idea for how to avoid * 8
uint8_t getByte (uint32_t x, int n) {
return (x >> (n << 3)) & 0xff;
}

I don't understand how your function works. Try this instead:
int getByte(int x, int n)
{
return (x >> (8 * n)) & 0xFF;
}

Rounding up to next power of 2

I want to write a function that returns the nearest next power of 2 number. For example if my input is 789, the output should be 1024. Is there any way of achieving this without using any loops but just using some bitwise operators?

Check the Bit Twiddling Hacks. You need to get the base 2 logarithm, then add 1 to that. Example for a 32-bit value:
Round up to the next highest power of 2
unsigned int v; // compute the next highest power of 2 of 32-bit v
v--;
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;
The extension to other widths should be obvious.

next = pow(2, ceil(log(x)/log(2)));
This works by finding the number you'd have raise 2 by to get x (take the log of the number, and divide by the log of the desired base, see wikipedia for more). Then round that up with ceil to get the nearest whole number power.
This is a more general purpose (i.e. slower!) method than the bitwise methods linked elsewhere, but good to know the maths, eh?

I think this works, too:
int power = 1;
while(power < x)
power*=2;
And the answer is power.

unsigned long upper_power_of_two(unsigned long v)
{
v--;
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;
return v;
}

If you're using GCC, you might want to have a look at Optimizing the next_pow2() function by Lockless Inc.. This page describes a way to use built-in function builtin_clz() (count leading zero) and later use directly x86 (ia32) assembler instruction bsr (bit scan reverse), just like it's described in another answer's link to gamedev site. This code might be faster than those described in previous answer.
By the way, if you're not going to use assembler instruction and 64bit data type, you can use this
/**
* return the smallest power of two value
* greater than x
*
* Input range: [2..2147483648]
* Output range: [2..2147483648]
*
*/
__attribute__ ((const))
static inline uint32_t p2(uint32_t x)
{
#if 0
assert(x > 1);
assert(x <= ((UINT32_MAX/2) + 1));
#endif
return 1 << (32 - __builtin_clz (x - 1));
}

One more, although I use cycle, but thi is much faster than math operands
power of two "floor" option:
int power = 1;
while (x >>= 1) power <<= 1;
power of two "ceil" option:
int power = 2;
x--; // <<-- UPDATED
while (x >>= 1) power <<= 1;
UPDATE
As mentioned in comments there was mistake in ceil where its result was wrong.
Here are full functions:
unsigned power_floor(unsigned x) {
int power = 1;
while (x >>= 1) power <<= 1;
return power;
}
unsigned power_ceil(unsigned x) {
if (x <= 1) return 1;
int power = 2;
x--;
while (x >>= 1) power <<= 1;
return power;
}

In standard c++20 this is included in <bit>.
The answer is simply
#include <bit>
unsigned long upper_power_of_two(unsigned long v)
{
return std::bit_ceil(v);
}
NOTE:
The solution I gave is for c++, not c, I would give an answer this question instead, but it was closed as a duplicate of this one!

For any unsigned type, building on the Bit Twiddling Hacks:
#include <climits>
#include <type_traits>
template <typename UnsignedType>
UnsignedType round_up_to_power_of_2(UnsignedType v) {
static_assert(std::is_unsigned<UnsignedType>::value, "Only works for unsigned types");
v--;
for (size_t i = 1; i < sizeof(v) * CHAR_BIT; i *= 2) //Prefer size_t "Warning comparison between signed and unsigned integer"
{
v |= v >> i;
}
return ++v;
}
There isn't really a loop there as the compiler knows at compile time the number of iterations.

Despite the question is tagged as c here my five cents. Lucky us, C++ 20 would include std::ceil2 and std::floor2 (see here). It is consexpr template functions, current GCC implementation uses bitshifting and works with any integral unsigned type.

For IEEE floats you'd be able to do something like this.
int next_power_of_two(float a_F){
int f = *(int*)&a_F;
int b = f << 9 != 0; // If we're a power of two this is 0, otherwise this is 1
f >>= 23; // remove factional part of floating point number
f -= 127; // subtract 127 (the bias) from the exponent
// adds one to the exponent if were not a power of two,
// then raises our new exponent to the power of two again.
return (1 << (f + b));
}
If you need an integer solution and you're able to use inline assembly, BSR will give you the log2 of an integer on the x86. It counts how many right bits are set, which is exactly equal to the log2 of that number. Other processors have similar instructions (often), such as CLZ and depending on your compiler there might be an intrinsic available to do the work for you.

Here's my solution in C. Hope this helps!
int next_power_of_two(int n) {
int i = 0;
for (--n; n > 0; n >>= 1) {
i++;
}
return 1 << i;
}

In x86 you can use the sse4 bit manipulation instructions to make it fast.
//assume input is in eax
mov ecx,31
popcnt edx,eax //cycle 1
lzcnt eax,eax //cycle 2
sub ecx,eax
mov eax,1
cmp edx,1 //cycle 3
jle #done //cycle 4 - popcnt says its a power of 2, return input unchanged
shl eax,cl //cycle 5
#done: rep ret //cycle 5
In c you can use the matching intrinsics.
Or jumpless, which speeds up things by avoiding a misprediction due to a jump, but slows things down by lengthening the dependency chain. Time the code to see which works best for you.
//assume input is in eax
mov ecx,31
popcnt edx,eax //cycle 1
lzcnt eax,eax
sub ecx,eax
mov eax,1 //cycle 2
cmp edx,1
mov edx,0 //cycle 3
cmovle ecx,edx //cycle 4 - ensure eax does not change
shl eax,cl
#done: rep ret //cycle 5

/*
** http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
*/
#define __LOG2A(s) ((s &0xffffffff00000000) ? (32 +__LOG2B(s >>32)): (__LOG2B(s)))
#define __LOG2B(s) ((s &0xffff0000) ? (16 +__LOG2C(s >>16)): (__LOG2C(s)))
#define __LOG2C(s) ((s &0xff00) ? (8 +__LOG2D(s >>8)) : (__LOG2D(s)))
#define __LOG2D(s) ((s &0xf0) ? (4 +__LOG2E(s >>4)) : (__LOG2E(s)))
#define __LOG2E(s) ((s &0xc) ? (2 +__LOG2F(s >>2)) : (__LOG2F(s)))
#define __LOG2F(s) ((s &0x2) ? (1) : (0))
#define LOG2_UINT64 __LOG2A
#define LOG2_UINT32 __LOG2B
#define LOG2_UINT16 __LOG2C
#define LOG2_UINT8 __LOG2D
static inline uint64_t
next_power_of_2(uint64_t i)
{
#if defined(__GNUC__)
return 1UL <<(1 +(63 -__builtin_clzl(i -1)));
#else
i =i -1;
i =LOG2_UINT64(i);
return 1UL <<(1 +i);
#endif
}
If you do not want to venture into the realm of undefined behaviour the input value must be between 1 and 2^63. The macro is also useful to set constant at compile time.

For completeness here is a floating-point implementation in bog-standard C.
double next_power_of_two(double value) {
int exp;
if(frexp(value, &exp) == 0.5) {
// Omit this case to round precise powers of two up to the *next* power
return value;
}
return ldexp(1.0, exp);
}

An efficient Microsoft (e.g., Visual Studio 2017) specific solution in C / C++ for integer input. Handles the case of the input exactly matching a power of two value by decrementing before checking the location of the most significant 1 bit.
inline unsigned int ExpandToPowerOf2(unsigned int Value)
{
unsigned long Index;
_BitScanReverse(&Index, Value - 1);
return (1U << (Index + 1));
}
// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#if defined(WIN64) // The _BitScanReverse64 intrinsic is only available for 64 bit builds because it depends on x64
inline unsigned long long ExpandToPowerOf2(unsigned long long Value)
{
unsigned long Index;
_BitScanReverse64(&Index, Value - 1);
return (1ULL << (Index + 1));
}
#endif
This generates 5 or so inlined instructions for an Intel processor similar to the following:
dec eax
bsr rcx, rax
inc ecx
mov eax, 1
shl rax, cl
Apparently the Visual Studio C++ compiler isn't coded to optimize this for compile-time values, but it's not like there are a whole lot of instructions there.
Edit:
If you want an input value of 1 to yield 1 (2 to the zeroth power), a small modification to the above code still generates straight through instructions with no branch.
inline unsigned int ExpandToPowerOf2(unsigned int Value)
{
unsigned long Index;
_BitScanReverse(&Index, --Value);
if (Value == 0)
Index = (unsigned long) -1;
return (1U << (Index + 1));
}
Generates just a few more instructions. The trick is that Index can be replaced by a test followed by a cmove instruction.

Trying to make an "ultimate" solution for this. The following code
is targeted for C language (not C++),
uses compiler built-ins to yield efficient code (CLZ or BSR instruction) if compiler supports any,
is portable (standard C and no assembly) with the exception of built-ins, and
addresses all undefined behaviors.
If you're writing in C++, you may adjust the code appropriately. Note that C++20 introduces std::bit_ceil which does the exact same thing except the behavior may be undefined on certain conditions.
#include <limits.h>
#ifdef _MSC_VER
# if _MSC_VER >= 1400
/* _BitScanReverse is introduced in Visual C++ 2005 and requires
<intrin.h> (also introduced in Visual C++ 2005). */
#include <intrin.h>
#pragma intrinsic(_BitScanReverse)
#pragma intrinsic(_BitScanReverse64)
# define HAVE_BITSCANREVERSE 1
# endif
#endif
/* Macro indicating that the compiler supports __builtin_clz().
The name HAVE_BUILTIN_CLZ seems to be the most common, but in some
projects HAVE__BUILTIN_CLZ is used instead. */
#ifdef __has_builtin
# if __has_builtin(__builtin_clz)
# define HAVE_BUILTIN_CLZ 1
# endif
#elif defined(__GNUC__)
# if (__GNUC__ > 3)
# define HAVE_BUILTIN_CLZ 1
# elif defined(__GNUC_MINOR__)
# if (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
# define HAVE_BUILTIN_CLZ 1
# endif
# endif
#endif
/**
* Returns the smallest power of two that is not smaller than x.
*/
unsigned long int next_power_of_2_long(unsigned long int x)
{
if (x <= 1) {
return 1;
}
x--;
#ifdef HAVE_BITSCANREVERSE
if (x > (ULONG_MAX >> 1)) {
return 0;
} else {
unsigned long int index;
(void) _BitScanReverse(&index, x);
return (1UL << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (ULONG_MAX >> 1)) {
return 0;
}
return (1UL << (sizeof(x) * CHAR_BIT - __builtin_clzl(x)));
#else
/* Solution from "Bit Twiddling Hacks"
<http://www.graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2>
but converted to a loop for smaller code size.
("gcc -O3" will unroll this.) */
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}
unsigned int next_power_of_2(unsigned int x)
{
if (x <= 1) {
return 1;
}
x--;
#ifdef HAVE_BITSCANREVERSE
if (x > (UINT_MAX >> 1)) {
return 0;
} else {
unsigned long int index;
(void) _BitScanReverse(&index, x);
return (1U << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (UINT_MAX >> 1)) {
return 0;
}
return (1U << (sizeof(x) * CHAR_BIT - __builtin_clz(x)));
#else
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}
unsigned long long next_power_of_2_long_long(unsigned long long x)
{
if (x <= 1) {
return 1;
}
x--;
#if (defined(HAVE_BITSCANREVERSE) && \
ULLONG_MAX == 18446744073709551615ULL)
if (x > (ULLONG_MAX >> 1)) {
return 0;
} else {
/* assert(sizeof(__int64) == sizeof(long long)); */
unsigned long int index;
(void) _BitScanReverse64(&index, x);
return (1ULL << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (ULLONG_MAX >> 1)) {
return 0;
}
return (1ULL << (sizeof(x) * CHAR_BIT - __builtin_clzll(x)));
#else
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}

Portable solution in C#:
int GetNextPowerOfTwo(int input) {
return 1 << (int)Math.Ceiling(Math.Log2(input));
}
Math.Ceiling(Math.Log2(value)) calculates the exponent of the next power of two, the 1 << calculates the real value through bitshifting.
Faster solution if you have .NET Core 3 or above:
uint GetNextPowerOfTwoFaster(uint input) {
return (uint)1 << (sizeof(uint) * 8 - System.Numerics.BitOperations.LeadingZeroCount(input - 1));
}
This uses System.Numerics.BitOperations.LeadingZeroCount() which uses a hardware instruction if available:
https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/Numerics/BitOperations.cs
Update:
RoundUpToPowerOf2() is Coming in .NET 6! The internal implementation is mostly the same as the .NET Core 3 solution above.
Here's the community update.

You might find the following clarification to be helpful towards your purpose:

constexpr version of clp2 for C++14
#include <iostream>
#include <type_traits>
// Closest least power of 2 minus 1. Returns 0 if n = 0.
template <typename UInt, std::enable_if_t<std::is_unsigned<UInt>::value,int> = 0>
constexpr UInt clp2m1(UInt n, unsigned i = 1) noexcept
{ return i < sizeof(UInt) * 8 ? clp2m1(UInt(n | (n >> i)),i << 1) : n; }
/// Closest least power of 2 minus 1. Returns 0 if n <= 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value && std::is_signed<Int>::value,int> = 0>
constexpr auto clp2m1(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n <= 0 ? 0 : n)); }
/// Closest least power of 2. Returns 2^N: 2^(N-1) < n <= 2^N. Returns 0 if n <= 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value,int> = 0>
constexpr auto clp2(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n-1)) + 1; }
/// Next power of 2. Returns 2^N: 2^(N-1) <= n < 2^N. Returns 1 if n = 0. Returns 0 if n < 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value,int> = 0>
constexpr auto np2(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n)) + 1; }
template <typename T>
void test(T v) { std::cout << clp2(v) << std::endl; }
int main()
{
test(-5); // 0
test(0); // 0
test(8); // 8
test(31); // 32
test(33); // 64
test(789); // 1024
test(char(260)); // 4
test(unsigned(-1) - 1); // 0
test<long long>(unsigned(-1) - 1); // 4294967296
return 0;
}

Many processor architectures support log base 2 or very similar operation – count leading zeros. Many compilers have intrinsics for it. See https://en.wikipedia.org/wiki/Find_first_set

Assuming you have a good compiler & it can do the bit twiddling before hand thats above me at this point, but anyway this works!!!
// http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
#define SH1(v) ((v-1) | ((v-1) >> 1)) // accidently came up w/ this...
#define SH2(v) ((v) | ((v) >> 2))
#define SH4(v) ((v) | ((v) >> 4))
#define SH8(v) ((v) | ((v) >> 8))
#define SH16(v) ((v) | ((v) >> 16))
#define OP(v) (SH16(SH8(SH4(SH2(SH1(v))))))
#define CB0(v) ((v) - (((v) >> 1) & 0x55555555))
#define CB1(v) (((v) & 0x33333333) + (((v) >> 2) & 0x33333333))
#define CB2(v) ((((v) + ((v) >> 4) & 0xF0F0F0F) * 0x1010101) >> 24)
#define CBSET(v) (CB2(CB1(CB0((v)))))
#define FLOG2(v) (CBSET(OP(v)))
Test code below:
#include <iostream>
using namespace std;
// http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
#define SH1(v) ((v-1) | ((v-1) >> 1)) // accidently guess this...
#define SH2(v) ((v) | ((v) >> 2))
#define SH4(v) ((v) | ((v) >> 4))
#define SH8(v) ((v) | ((v) >> 8))
#define SH16(v) ((v) | ((v) >> 16))
#define OP(v) (SH16(SH8(SH4(SH2(SH1(v))))))
#define CB0(v) ((v) - (((v) >> 1) & 0x55555555))
#define CB1(v) (((v) & 0x33333333) + (((v) >> 2) & 0x33333333))
#define CB2(v) ((((v) + ((v) >> 4) & 0xF0F0F0F) * 0x1010101) >> 24)
#define CBSET(v) (CB2(CB1(CB0((v)))))
#define FLOG2(v) (CBSET(OP(v)))
#define SZ4 FLOG2(4)
#define SZ6 FLOG2(6)
#define SZ7 FLOG2(7)
#define SZ8 FLOG2(8)
#define SZ9 FLOG2(9)
#define SZ16 FLOG2(16)
#define SZ17 FLOG2(17)
#define SZ127 FLOG2(127)
#define SZ1023 FLOG2(1023)
#define SZ1024 FLOG2(1024)
#define SZ2_17 FLOG2((1ul << 17)) //
#define SZ_LOG2 FLOG2(SZ)
#define DBG_PRINT(x) do { std::printf("Line:%-4d" " %10s = %-10d\n", __LINE__, #x, x); } while(0);
uint32_t arrTble[FLOG2(63)];
int main(){
int8_t n;
DBG_PRINT(SZ4);
DBG_PRINT(SZ6);
DBG_PRINT(SZ7);
DBG_PRINT(SZ8);
DBG_PRINT(SZ9);
DBG_PRINT(SZ16);
DBG_PRINT(SZ17);
DBG_PRINT(SZ127);
DBG_PRINT(SZ1023);
DBG_PRINT(SZ1024);
DBG_PRINT(SZ2_17);
return(0);
}
Outputs:
Line:39 SZ4 = 2
Line:40 SZ6 = 3
Line:41 SZ7 = 3
Line:42 SZ8 = 3
Line:43 SZ9 = 4
Line:44 SZ16 = 4
Line:45 SZ17 = 5
Line:46 SZ127 = 7
Line:47 SZ1023 = 10
Line:48 SZ1024 = 10
Line:49 SZ2_16 = 17

I'm trying to get nearest lower power of 2 and made this function. May it help you.Just multiplied nearest lower number times 2 to get nearest upper power of 2
int nearest_upper_power(int number){
int temp=number;
while((number&(number-1))!=0){
temp<<=1;
number&=temp;
}
//Here number is closest lower power
number*=2;
return number;
}

Adapted Paul Dixon's answer to Excel, this works perfectly.
=POWER(2,CEILING.MATH(LOG(A1)/LOG(2)))

A variant of #YannDroneaud answer valid for x==1, only for x86 plateforms, compilers, gcc or clang:
__attribute__ ((const))
static inline uint32_t p2(uint32_t x)
{
#if 0
assert(x > 0);
assert(x <= ((UINT32_MAX/2) + 1));
#endif
int clz;
uint32_t xm1 = x-1;
asm(
"lzcnt %1,%0"
:"=r" (clz)
:"rm" (xm1)
:"cc"
);
return 1 << (32 - clz);
}

Here is what I'm using to have this be a constant expression, if the input is a constant expression.
#define uptopow2_0(v) ((v) - 1)
#define uptopow2_1(v) (uptopow2_0(v) | uptopow2_0(v) >> 1)
#define uptopow2_2(v) (uptopow2_1(v) | uptopow2_1(v) >> 2)
#define uptopow2_3(v) (uptopow2_2(v) | uptopow2_2(v) >> 4)
#define uptopow2_4(v) (uptopow2_3(v) | uptopow2_3(v) >> 8)
#define uptopow2_5(v) (uptopow2_4(v) | uptopow2_4(v) >> 16)
#define uptopow2(v) (uptopow2_5(v) + 1) /* this is the one programmer uses */
So for instance, an expression like:
uptopow2(sizeof (struct foo))
will nicely reduce to a constant.

The g++ compiler provides a builtin function __builtin_clz that counts leading zeros:
So we could do:
int nextPowerOfTwo(unsigned int x) {
return 1 << sizeof(x)*8 - __builtin_clz(x);
}
int main () {
std::cout << nextPowerOfTwo(7) << std::endl;
std::cout << nextPowerOfTwo(31) << std::endl;
std::cout << nextPowerOfTwo(33) << std::endl;
std::cout << nextPowerOfTwo(8) << std::endl;
std::cout << nextPowerOfTwo(91) << std::endl;
return 0;
}
Results:
8
32
64
16
128
But note that, for x == 0, __builtin_clz return is undefined.

If you need it for OpenGL related stuff:
/* Compute the nearest power of 2 number that is
* less than or equal to the value passed in.
*/
static GLuint
nearestPower( GLuint value )
{
int i = 1;
if (value == 0) return -1; /* Error! */
for (;;) {
if (value == 1) return i;
else if (value == 3) return i*4;
value >>= 1; i *= 2;
}
}

Convert it to a float and then use .hex() which shows the normalized IEEE representation.
>>> float(789).hex()
'0x1.8a80000000000p+9'
Then just extract the exponent and add 1.
>>> int(float(789).hex().split('p+')[1]) + 1
10
And raise 2 to this power.
>>> 2 ** (int(float(789).hex().split('p+')[1]) + 1)
1024

from math import ceil, log2
pot_ceil = lambda N: 0x1 << ceil(log2(N))
Test:
for i in range(10):
print(i, pot_ceil(i))
Output:
1 1
2 2
3 4
4 4
5 8
6 8
7 8
8 8
9 16
10 16

import sys
def is_power2(x):
return x > 0 and ((x & (x - 1)) == 0)
def find_nearest_power2(x):
if x <= 0:
raise ValueError("invalid input")
if is_power2(x):
return x
else:
bits = get_bits(x)
upper = 1 << (bits)
lower = 1 << (bits - 1)
mid = (upper + lower) // 2
if (x - mid) > 0:
return upper
else:
return lower
def get_bits(x):
"""return number of bits in binary representation"""
if x < 0:
raise ValueError("invalid input: input should be positive integer")
count = 0
while (x != 0):
try:
x = x >> 1
except TypeError as error:
print(error, "input should be of type integer")
sys.exit(1)
count += 1
return count

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to implement SWAR unsigned less-than? - c

Related

Wrong results multiplying two 32 bit numbers in C

How to interleave 2 booleans using bitwise operators?

Is there a more efficient way of expanding a char to an uint64_t?

retrieve byte from 32 bit integer using bitwise operators

Rounding up to next power of 2

Categories

Resources