Efficient computation of greatest power of 2 < x [duplicate] - c

This question already has answers here:
Find most significant bit (left-most) that is set in a bit array
(17 answers)
Compute fast log base 2 ceiling
(15 answers)
Closed 9 years ago.
I have a requirement to compute the greatest power of 2 which is < an integer value, x
currently I am using:
#define log2(x) log(x)/log(2)
#define round(x) (int)(x+0.5)
x = round(pow(2,(ceil(log2(n))-1)));
this is in a performance critical function
Is there a more computationally efficient way of calculating x?

You are essentially looking for the highest non-zero bit in your number. Many processors have built-in instructions for this, which in turn are exposed by many compilers. For example, in GCC I would look at __builtin_clz, which
Returns the number of leading 0-bits in x, starting at the most significant bit position.
Together with sizeof(int) * CHAR_BIT and a shift, you can use this to figure out the corresponding pure-power-of-two integer. There's also a version for long integers.
(The CPU instruction is presumably called "CLZ" (count leading zeros), in case you need to look this up for other compilers.)

I have an integer log2 function in my c-libutl library (hosted on googlecode if anyone is interested)
/*
** Integer log base 2 of a 32 bits integer values.
** llog2(0) == llog2(1) == 0
*/
unsigned short llog2(unsigned long x)
{
long l = 0;
x &= 0xFFFFFFFF /* just in case 'long' is more than 32bit */
if (x==0) return 0;
#ifndef UTL_NOASM
#if defined(__POCC__) || defined(_MSC_VER) || defined (__WATCOMC__)
/* Pelles C MS Visual C++ OpenWatcom */
__asm { mov eax, [x]
bsr ecx, eax
mov l, ecx
}
#elif defined(__GNUC__)
l = (unsigned short) ((sizeof(long)*8 -1) - __builtin_clzl(x));
#else
#define UTL_NOASM
#endif
#endif
#ifdef UTL_NOASM /* Make a binary search.*/
if (x & 0xFFFF0000) {l += 16; x >>= 16;} /* 11111111111111110000000000000000 */
if (x & 0xFF00) {l += 8; x >>= 8 ;} /* 1111111100000000*/
if (x & 0xF0) {l += 4; x >>= 4 ;} /* 11110000*/
if (x & 0xC) {l += 2; x >>= 2 ;} /* 1100 */
if (x & 2) {l += 1; } /* 10 */
return l;
#endif
return (unsigned short)l;
}
Then you can simply compute
(1 << llog2(x))
to compute the greatest power of two that is less than x. Beware 0! You should handle it separately.
It uses assembler code but can also be forced to plain C code by defining the UTL_NOASM symbol.
The code has been tested at the time but it's quite some time I don't use it and I can't say if it behaves in a 64-bit environment.

Based on Bit Twiddling Hacks: Find the log base 2 of an N-bit integer in O(lg(N)) operations by Sean Eron Anderson (code contributed by Eric Cole and Andrew Shapira):
unsigned int highest_bit (uint32_t v) {
unsigned int r = 0, s;
s = (v > 0xFFFF) << 4; v >>= s; r |= s;
s = (v > 0xFF ) << 3; v >>= s; r |= s;
s = (v > 0xF ) << 2; v >>= s; r |= s;
s = (v > 0x3 ) << 1; v >>= s; r |= s;
return r | (v >> 1);
}
This returns the index of the highest bit of the input; the greatest power of 2 no greater than the input is then 1 << highest_bit(x), and the greatest power of 2 strictly less than the input is thus simply 1 << highest_bit(x-1).
For 64-bit inputs, just change the input type to uint64_t and add the following extra line at the beginning of the function, after the variable declarations:
s = (v > 0xFFFFFFFF) << 8; v >>= s; r |= s;

Left and right shift operators do this the best
int MaxPowerOf2(int x)
{
int out = 1;
while(x > 1) { x>>1; out<<1;}
return out;
}

#include <math.h>
double greatestPower( double x )
{
return floor(log( x ) / log( 2 ));
}
That is true since log in monotony increasing function.

Shifting bits around will most likely be much faster. Probably some bisection method on bits could make it even faster. Nice exercise for an improvement.
#include <stdio.h>
int closestPow2(int x)
{
int p;
if (x <= 1) return 0; /* No such power exists */
x--; /* Account for exact powers of 2, then one power less must be returned */
for (p = 0; x > 0; p++)
{
x >>= 1;
}
return 1<<(p-1);
}
int main(void)
{
printf("%x\n", closestPow2(0x7FFFFFFF));
return 0;
}

Related

How to correctly implement multiply for floating point numbers (software FP)

My program is about a method which is given floats and in this method I want to multiply or add those floats. But not multiply like a * b, I want to break those floats down to their structure like the bit for the sign, the 8 bit for the exponent and the rest of the bits as the mantissa.
I want to implement / emulate software floating-point add and multiply (to learn more about what FP hardware has to do).
In the head of the program there are the breakdowns:
#define SIGN(x) (x>>31);
#define MANT(x) (x&0x7FFFFF);
#define EXPO(x) ((x>>23)&0xFF);
#define SPLIT(x, s, m, e) do { \
s = SIGN(x); \
m = MANT(x); \
e = EXPO(x); \
if ( e != 0x00 && e != 0xFF ) { \
m |= 0x800000; \
} \
} while ( 0 )
#define BUILD(x, s, m, e) do { \
x = (s << 31) | (e<<23) | (m&0x7FFFFF); \
} while ( 0 )
The main looks as follows:
float f = 2.3;
float g = 1.8;
float h = foo(&f, &g);
And the method for the calculation looks like:
float foo(float *a, float *b) {
uint32_t ia = *(unsigned int *)a;
uint32_t ib = *(unsigned int *)b;
uint32_t result = 0;
uint32_t signa, signb, signr;
uint32_t manta, mantb, mantr;
uint32_t expoa, expob, expor;
SPLIT(ia, signa, manta, expoa);
SPLIT(ib, signb, mantb, expob);
I already tried the multiply by adding the exponents and multiply their mantissas as follow:
expor = (expoa -127) + (expob -127) + 127;
mantr = (manta) * (mantb);
signr = signa ^ signb;
The return and rebuild of the new float:
BUILD(result, signr, mantr, expor);
return *(float *)&result;
The problem is now, that the result is wrong. the mantr even takes a very low negative Number (in case if foo gets 1.5 and 2.4 mantr takes -838860800 and the result is 2.0000000).
You can't just take truncate the result of the mantissa multiply, you need to take the top 24 bits (after using the low half for rounding) and renormalize (adjust the exponent).
Floating point operations keep the top significand bits. The most significant part of the integer product is the high bits; the low bits are further places after the decimal. (Terminology: it's a "binary point", not "decimal point", because binary floats use radix 2 (binary), not 10 (decimal).)
For normalized inputs, the implicit leading 1 in the input significands means the 32x32 => 64-bit uint64_t product that you use to implement 24 x 24 => 48-bit mantissa multiplication will have its high bit in one of 2 possible locations, so you don't need a bit-scan to find it. A compare or single-bit-test will do.
For subnormal inputs, that's not guaranteed so you need to check where the MSB is, e.g. with GNU C __builtin_clzll. (There are many special cases to handle for one or both inputs being subnormal, and/or the output being subnormal.)
See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for more about the IEEE-754 binary32 format, including the implied leading 1 of the significand.
And see #njuffa's answer for an actual tested + working implementation that does 64-bit operations as two 32-bit halves for some reason, instead of letting C do that efficiently.
Also, return *(float *)&result; violates strict aliasing. It's only safe on MSVC. Use a union or memcpy for type punning in C99 / C11.
Emulating the multiplication of two IEEE-754 (2008) binary32 operands is a bit more complex than the question suggests. In general, we have to distinguish the following operand classes: zeros, subnormals (0 < |x| < 2-126), normals (2126 ≤ |x| < 2128), infinities, NaNs. Normals use biased exponents in [1, 254], while any of the special operand classes use biased exponents in {0, 255}. The following assumes we want to implement floating-point multiply with all floating-point exceptions masked, and using the round-to-nearest-to-even rounding mode.
First, we check whether any of the arguments belongs to a special operand class. If so, we check the special cases in sequence. If one of the arguments is a NaN, we turn that NaN into a QNaN and return it. If one of the operands is zero, we return an appropriately signed zero, unless the other argument is an infinity, in which case we return a special QNaN INDEFINITE since this is an invalid operation. After that we check for any argument of infinity, returning an appropriately signed infinity. This leaves subnormals, which we normalize. In case there are two subnormal arguments, we only need to normalize one of them as the result will underflow to zero.
The multiplication of normals proceeds as the asker envisioned in the question. The sign of the result is the exclusive-OR of the signs of the arguments, the exponent of the result is the sum of the exponents of the arguments (adjusted for exponent bias), and the significand of the result is generated from the product of the significant of the arguments. We need the full product for rounding. We can either use a 64-bit type for that, or represent it with a pair of 32-bit numbers. In the code below I have chose the latter representation. Rounding to nearest-or-even is straightforward: if we have a tie-case (the result is exactly in the middle between the closest two binary32 number), we need to round up if the least significant bit of the mantissa is 1. Otherwise, we need to round up if the most significant discarded bit (the round bit) is 1.
Three cases need to be considered for the result, based on the result exponent prior to rounding: Exponent is in normal range, result overflows (too large in magnitude), or it underflows (too small in magnitude). In the first case, the result is a normal or infinity if overflow occurs during rounding. In the second case, the result is infinity. In the last case the result is either zero (severe underflow), a subnormal, or the smallest normal (if round-up occurs).
The following code, with a simple framework for light testing via gobs of random test cases and several thousand interesting patterns shows an exemplary ISO-C implementation written in a couple of hours for reasonable clarity and reasonable performance. I let the test framework run for an hour or so on an x64 platform and no errors were reported. If you plan to use the code in production, you would want to construct a more stringent test framework, and may need additional performance tuning.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <limits.h>
#define FLOAT_MANT_BITS (23)
#define FLOAT_EXPO_BITS (8)
#define FLOAT_EXPO_BIAS (127)
#define FLOAT_MANT_MASK (~((~0u) << (FLOAT_MANT_BITS+1))) /* incl. integer bit */
#define EXPO_ADJUST (1) /* adjustment for performance reasons */
#define MIN_NORM_EXPO (1) /* minimum biased exponent of normals */
#define MAX_NORM_EXPO (254) /* maximum biased exponent of normals */
#define INF_EXPO (255) /* biased exponent of infinities */
#define EXPO_MASK (~((~0u) << FLOAT_EXPO_BITS))
#define FLOAT_SIGN_MASK (0x80000000u)
#define FLOAT_IMPLICIT_BIT (1 << FLOAT_MANT_BITS)
#define RND_BIT_SHIFT (31)
#define RND_BIT_MASK (1u << RND_BIT_SHIFT)
#define FLOAT_INFINITY (0x7f800000)
#define FLOAT_INDEFINITE (0xffc00000u)
#define MANT_LSB (0x00000001)
#define FLOAT_QNAN_BIT (0x00400000)
#define MAX_SHIFT (FLOAT_MANT_BITS + 2)
uint32_t fp32_mul_core (uint32_t a, uint32_t b)
{
uint64_t prod;
uint32_t expoa, expob, manta, mantb, shift;
uint32_t r, signr, expor, mantr_hi, mantr_lo;
/* split arguments into sign, exponent, significand */
expoa = ((a >> FLOAT_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
expob = ((b >> FLOAT_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
manta = (a | FLOAT_IMPLICIT_BIT) & FLOAT_MANT_MASK;
mantb = (b | FLOAT_IMPLICIT_BIT) & FLOAT_MANT_MASK;
/* result sign bit: XOR sign argument signs */
signr = (a ^ b) & FLOAT_SIGN_MASK;
if ((expoa >= (MAX_NORM_EXPO - EXPO_ADJUST)) || /* at least one argument is special */
(expob >= (MAX_NORM_EXPO - EXPO_ADJUST))) {
if ((a & ~FLOAT_SIGN_MASK) > FLOAT_INFINITY) { /* a is NaN */
/* return quietened NaN */
return a | FLOAT_QNAN_BIT;
}
if ((b & ~FLOAT_SIGN_MASK) > FLOAT_INFINITY) { /* b is NaN */
/* return quietened NaN */
return b | FLOAT_QNAN_BIT;
}
if ((a & ~FLOAT_SIGN_MASK) == 0) { /* a is zero */
/* return NaN if b is infinity, else zero */
return (expob != (INF_EXPO - EXPO_ADJUST)) ? signr : FLOAT_INDEFINITE;
}
if ((b & ~FLOAT_SIGN_MASK) == 0) { /* b is zero */
/* return NaN if a is infinity, else zero */
return (expoa != (INF_EXPO - EXPO_ADJUST)) ? signr : FLOAT_INDEFINITE;
}
if (((a & ~FLOAT_SIGN_MASK) == FLOAT_INFINITY) || /* a or b infinity */
((b & ~FLOAT_SIGN_MASK) == FLOAT_INFINITY)) {
return signr | FLOAT_INFINITY;
}
if ((int32_t)expoa < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* a is subnormal */
/* normalize significand of a */
manta = a & FLOAT_MANT_MASK;
expoa++;
do {
manta = 2 * manta;
expoa--;
} while (manta < FLOAT_IMPLICIT_BIT);
} else if ((int32_t)expob < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* b is subnormal */
/* normalize significand of b */
mantb = b & FLOAT_MANT_MASK;
expob++;
do {
mantb = 2 * mantb;
expob--;
} while (mantb < FLOAT_IMPLICIT_BIT);
}
}
/* result exponent: add argument exponents and adjust for biasing */
expor = expoa + expob - FLOAT_EXPO_BIAS + 2 * EXPO_ADJUST;
mantb = mantb << FLOAT_EXPO_BITS; /* preshift to align result signficand */
/* result significand: multiply argument signficands */
prod = (uint64_t)manta * mantb;
mantr_hi = (uint32_t)(prod >> 32);
mantr_lo = (uint32_t)(prod >> 0);
/* normalize significand */
if (mantr_hi < FLOAT_IMPLICIT_BIT) {
mantr_hi = (mantr_hi << 1) | (mantr_lo >> (32 - 1));
mantr_lo = (mantr_lo << 1);
expor--;
}
if (expor <= (MAX_NORM_EXPO - EXPO_ADJUST)) { /* normal, may overflow to infinity during rounding */
/* combine biased exponent, sign and signficand */
r = (expor << FLOAT_MANT_BITS) + signr + mantr_hi;
/* round result to nearest or even; overflow to infinity possible */
r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
} else if ((int32_t)expor > (MAX_NORM_EXPO - EXPO_ADJUST)) { /* overflow */
/* return infinity */
r = signr | FLOAT_INFINITY;
} else { /* underflow */
/* return zero, normal, or smallest subnormal */
shift = 0 - expor;
if (shift > MAX_SHIFT) shift = MAX_SHIFT;
/* denormalize significand */
mantr_lo = mantr_hi << (32 - shift) | (mantr_lo ? 1 : 0);
mantr_hi = mantr_hi >> shift;
/* combine sign and signficand; biased exponent known to be zero */
r = mantr_hi + signr;
/* round result to nearest or even */
r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
}
return r;
}
uint32_t float_as_uint (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float fp32_mul (float a, float b)
{
return uint_as_float (fp32_mul_core (float_as_uint (a), float_as_uint (b)));
}
/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define ISNAN(x) ((float_as_uint (x) << 1) > 0xff000000)
#define QNAN(x) (x | FLOAT_QNAN_BIT)
#define PURELY_RANDOM (0)
#define PATTERN_BASED (1)
#define TEST_MODE (PURELY_RANDOM)
uint32_t v[8192];
int main (void)
{
unsigned long long count = 0;
float a, b, res, ref;
uint32_t i, j, patterns, idx = 0, nbrBits = sizeof (uint32_t) * CHAR_BIT;
/* pattern class 1: 2**i */
for (i = 0; i < nbrBits; i++) {
v [idx] = ((uint32_t)1 << i);
idx++;
}
/* pattern class 2: 2**i-1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = (((uint32_t)1 << i) - 1);
idx++;
}
/* pattern class 3: 2**i+1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = (((uint32_t)1 << i) + 1);
idx++;
}
/* pattern class 4: 2**i + 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = (((uint32_t)1 << i) + ((uint32_t)1 << j));
idx++;
}
}
/* pattern class 5: 2**i - 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = (((uint32_t)1 << i) - ((uint32_t)1 << j));
idx++;
}
}
/* pattern class 6: MAX_UINT/(2**i+1) rep. blocks of i zeros an i ones */
for (i = 0; i < nbrBits; i++) {
v [idx] = ((~(uint32_t)0) / (((uint32_t)1 << i) + 1));
idx++;
}
patterns = idx;
/* pattern class 6: one's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = ~v [i];
idx++;
}
/* pattern class 7: two's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = ~v [i] + 1;
idx++;
}
patterns = idx;
#if TEST_MODE == PURELY_RANDOM
printf ("using purely random test vectors\n");
#elif TEST_MODE == PATTERN_BASED
printf ("using pattern-based test vectors\n");
printf ("#patterns = %u\n", patterns);
#endif // TEST_MODE
do {
#if TEST_MODE == PURELY_RANDOM
a = uint_as_float (KISS);
b = uint_as_float (KISS);
#elif TEST_MODE == PATTERN_BASED
i = KISS % patterns;
j = KISS % patterns;
a = uint_as_float ((v[i] & 0x7fffff) | (KISS & ~0x7fffff));
b = uint_as_float ((v[j] & 0x7fffff) | (KISS & ~0x7fffff));
#endif // TEST_MODE
res = fp32_mul (a, b);
ref = a * b;
/* check for bit pattern mismatch between result and reference */
if (float_as_uint (res) != float_as_uint (ref)) {
/* if both a and b are NaNs, either could be returned quietened */
if (! (ISNAN (a) && ISNAN (b) &&
((QNAN (float_as_uint (a)) == float_as_uint (res)) ||
(QNAN (float_as_uint (b)) == float_as_uint (res))))) {
printf ("err: a=% 15.8e (%08x) b=% 15.8e (%08x) res=% 15.8e (%08x) ref=%15.8e (%08x)\n",
a, float_as_uint(a), b, float_as_uint (b), res, float_as_uint (res), ref, float_as_uint (ref));
return EXIT_FAILURE;
}
}
count++;
if (!(count & 0xffffff)) printf ("\r%llu", count);
} while (1);
return EXIT_SUCCESS;
}
It is much more complicated. Take a look on the source of the softmath library (for example https://github.com/riscv/riscv-pk/blob/master/softfloat/f64_mul.c). Clone it and analyze.

Even parity of a unsigned int [duplicate]

This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Closed 9 years ago.
/*A value has even parity if it has an even number of 1 bits.
*A value has an odd parity if it has an odd number of 1 bits.
*For example, 0110 has even parity, and 1110 has odd parity.
*Return 1 iff x has even parity.
*/
int has_even_parity(unsigned int x) {
}
I'm not sure where to begin writing this function, I'm thinking that I loop through the value as an array and apply xor operations on them.
Would something like the following work? If not, what is the way to approach this?
int has_even_parity(unsigned int x) {
int i, result = x[0];
for (i = 0; i < 3; i++){
result = result ^ x[i + 1];
}
if (result == 0){
return 1;
}
else{
return 0;
}
}
Option #1 - iterate the bits in the "obvious" way, at O(number of bits):
int has_even_parity(unsigned int x)
{
int p = 1;
while (x)
{
p ^= x&1;
x >>= 1; // at each iteration, we shift the input one bit to the right
}
return p;
Option #2 - iterate only the bits that are set to 1, at O(number of 1s):
int has_even_parity(unsigned int x)
{
int p = 1;
while (x)
{
p ^= 1;
x &= x-1; // at each iteration, we set the least significant 1 to 0
}
return p;
}
Option #3 - use the SWAR algorithm for counting 1s, at O(log(number of bits)):
http://aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
You can't access an integer as an array,
unsigned x = ...;
// x[0]; doesn't work
But you can use bitwise operations.
unsigned x = ...;
int n = ...;
int bit = (x >> n) & 1u; // Extract bit n, where bit 0 is the LSB
There is a clever way to do this, assuming 32-bit integers:
unsigned parity(unsigned x)
{
x ^= x >> 16;
x ^= x >> 8;
x ^= x >> 4;
x ^= x >> 2;
x ^= x >> 1;
return x & 1;
}

Bitwise operations equivalent of greater than operator

I am working on a function that will essentially see which of two ints is larger. The parameters that are passed are 2 32-bit ints. The trick is the only operators allowed are ! ~ | & << >> ^ (no casting, other data types besides signed int, *, /, -, etc..).
My idea so far is to ^ the two binaries together to see all the positions of the 1 values that they don't share. What I want to do is then take that value and isolate the 1 farthest to the left. Then see of which of them has that value in it. That value then will be the larger.
(Say we use 8-bit ints instead of 32-bit).
If the two values passed were 01011011 and 01101001
I used ^ on them to get 00100010.
I then want to make it 00100000 in other words 01xxxxxx -> 01000000
Then & it with the first number
!! the result and return it.
If it is 1, then the first # is larger.
Any thoughts on how to 01xxxxxx -> 01000000 or anything else to help?
Forgot to note: no ifs, whiles, fors etc...
Here's a loop-free version which compares unsigned integers in O(lg b) operations where b is the word size of the machine. Note the OP states no other data types than signed int, so it seems likely the top part of this answer does not meet the OP's specifications. (Spoiler version as at the bottom.)
Note that the behavior we want to capture is when the most significant bit mismatch is 1 for a and 0 for b. Another way of thinking about this is any bit in a being larger than the corresponding bit in b means a is greater than b, so long as there wasn't an earlier bit in a that was less than the corresponding bit in b.
To that end, we compute all the bits in a greater than the corresponding bits in b, and likewise compute all the bits in a less than the corresponding bits in b. We now want to mask out all the 'greater than' bits that are below any 'less than' bits, so we take all the 'less than' bits and smear them all to the right making a mask: the most significant bit set all the way down to the least significant bit are now 1.
Now all we have to do is remove the 'greater than' bits set by using simple bit masking logic.
The resulting value is 0 if a <= b and nonzero if a > b. If we want it to be 1 in the latter case we can do a similar smearing trick and just take a look at the least significant bit.
#include <stdio.h>
// Works for unsigned ints.
// Scroll down to the "actual algorithm" to see the interesting code.
// Utility function for displaying binary representation of an unsigned integer
void printBin(unsigned int x) {
for (int i = 31; i >= 0; i--) printf("%i", (x >> i) & 1);
printf("\n");
}
// Utility function to print out a separator
void printSep() {
for (int i = 31; i>= 0; i--) printf("-");
printf("\n");
}
int main()
{
while (1)
{
unsigned int a, b;
printf("Enter two unsigned integers separated by spaces: ");
scanf("%u %u", &a, &b);
getchar();
printBin(a);
printBin(b);
printSep();
/************ The actual algorithm starts here ************/
// These are all the bits in a that are less than their corresponding bits in b.
unsigned int ltb = ~a & b;
// These are all the bits in a that are greater than their corresponding bits in b.
unsigned int gtb = a & ~b;
ltb |= ltb >> 1;
ltb |= ltb >> 2;
ltb |= ltb >> 4;
ltb |= ltb >> 8;
ltb |= ltb >> 16;
// Nonzero if a > b
// Zero if a <= b
unsigned int isGt = gtb & ~ltb;
// If you want to make this exactly '1' when nonzero do this part:
isGt |= isGt >> 1;
isGt |= isGt >> 2;
isGt |= isGt >> 4;
isGt |= isGt >> 8;
isGt |= isGt >> 16;
isGt &= 1;
/************ The actual algorithm ends here ************/
// Print out the results.
printBin(ltb); // Debug info
printBin(gtb); // Debug info
printSep();
printBin(isGt); // The actual result
}
}
Note: This should work for signed integers as well if you flip the top bit on both of the inputs, e.g. a ^= 0x80000000.
Spoiler
If you want an answer that meets all of the requirements (including 25 operators or less):
int isGt(int a, int b)
{
int diff = a ^ b;
diff |= diff >> 1;
diff |= diff >> 2;
diff |= diff >> 4;
diff |= diff >> 8;
diff |= diff >> 16;
diff &= ~(diff >> 1) | 0x80000000;
diff &= (a ^ 0x80000000) & (b ^ 0x7fffffff);
return !!diff;
}
I'll leave explaining why it works up to you.
To convert 001xxxxx to 00100000, you first execute:
x |= x >> 4;
x |= x >> 2;
x |= x >> 1;
(this is for 8 bits; to extend it to 32, add shifts by 8 and 16 at the start of the sequence).
This leaves us with 00111111 (this technique is sometimes called "bit-smearing"). We can then chop off all but the first 1 bit:
x ^= x >> 1;
leaving us with 00100000.
An unsigned variant given that one can use logical (&&, ||) and comparison (!=, ==).
int u_isgt(unsigned int a, unsigned int b)
{
return a != b && ( /* If a == b then a !> b and a !< b. */
b == 0 || /* Else if b == 0 a has to be > b (as a != 0). */
(a / b) /* Else divide; integer division always truncate */
); /* towards zero. Giving 0 if a < b. */
}
!= and == can easily be eliminated., i.e.:
int u_isgt(unsigned int a, unsigned int b)
{
return a ^ b && (
!(b ^ 0) ||
(a / b)
);
}
For signed one could then expand to something like:
int isgt(int a, int b)
{
return
(a != b) &&
(
(!(0x80000000 & a) && 0x80000000 & b) || /* if a >= 0 && b < 0 */
(!(0x80000000 & a) && b == 0) ||
/* Two more lines, can add them if you like, but as it is homework
* I'll leave it up to you to decide.
* Hint: check on "both negative" and "both not negative". */
)
;
}
Can be more compact / eliminate ops. (at least one) but put it like this for clarity.
Instead of 0x80000000 one could say ie:
#include <limits.h>
static const int INT_NEG = (1 << ((sizeof(int) * CHAR_BIT) - 1));
Using this to test:
void test_isgt(int a, int b)
{
fprintf(stdout,
"%11d > %11d = %d : %d %s\n",
a, b,
isgt(a, b), (a > b),
isgt(a, b) != (a>b) ? "BAD!" : "OK!");
}
Result:
33 > 0 = 1 : 1 OK!
-33 > 0 = 0 : 0 OK!
0 > 33 = 0 : 0 OK!
0 > -33 = 1 : 1 OK!
0 > 0 = 0 : 0 OK!
33 > 33 = 0 : 0 OK!
-33 > -33 = 0 : 0 OK!
-5 > -33 = 1 : 1 OK!
-33 > -5 = 0 : 0 OK!
-2147483647 > 2147483647 = 0 : 0 OK!
2147483647 > -2147483647 = 1 : 1 OK!
2147483647 > 2147483647 = 0 : 0 OK!
2147483647 > 0 = 1 : 1 OK!
0 > 2147483647 = 0 : 0 OK!
A fully branchless version of Kaganar's smaller isGt function might look like so:
int isGt(int a, int b)
{
int diff = a ^ b;
diff |= diff >> 1;
diff |= diff >> 2;
diff |= diff >> 4;
diff |= diff >> 8;
diff |= diff >> 16;
//1+ on GT, 0 otherwise.
diff &= ~(diff >> 1) | 0x80000000;
diff &= (a ^ 0x80000000) & (b ^ 0x7fffffff);
//flatten back to range of 0 or 1.
diff |= diff >> 1;
diff |= diff >> 2;
diff |= diff >> 4;
diff |= diff >> 8;
diff |= diff >> 16;
diff &= 1;
return diff;
}
This clocks in at around 60 instructions for the actual computation (MSVC 2010 compiler, on an x86 arch), plus an extra 10 stack ops or so for the function's prolog/epilog.
EDIT:
Okay, there were some issues with the code, but I revised it and the following works.
This auxiliary function compares the numbers' n'th significant digit:
int compare ( int a, int b, int n )
{
int digit = (0x1 << n-1);
if ( (a & digit) && (b & digit) )
return 0; //the digit is the same
if ( (a & digit) && !(b & digit) )
return 1; //a is greater than b
if ( !(a & digit) && (b & digit) )
return -1; //b is greater than a
}
The following should recursively return the larger number:
int larger ( int a, int b )
{
for ( int i = 8*sizeof(a) - 1 ; i >= 0 ; i-- )
{
if ( int k = compare ( a, b, i ) )
{
return (k == 1) ? a : b;
}
}
return 0; //equal
}
As much as I don't want to do someone else's homework I couldn't resist this one.. :) I am sure others can think of a more compact one..but here is mine..works well, including negative numbers..
Edit: there are couple of bugs though. I will leave it to the OP to find it and fix it.
#include<unistd.h>
#include<stdio.h>
int a, b, i, ma, mb, a_neg, b_neg, stop;
int flipnum(int *num, int *is_neg) {
*num = ~(*num) + 1;
*is_neg = 1;
return 0;
}
int print_num1() {
return ((a_neg && printf("bigger number %d\n", mb)) ||
printf("bigger number %d\n", ma));
}
int print_num2() {
return ((b_neg && printf("bigger number %d\n", ma)) ||
printf("bigger number %d\n", mb));
}
int check_num1(int j) {
return ((a & j) && print_num1());
}
int check_num2(int j) {
return ((b & j) && print_num2());
}
int recursive_check (int j) {
((a & j) ^ (b & j)) && (check_num1(j) || check_num2(j)) && (stop = 1, j = 0);
return(!stop && (j = j >> 1) && recursive_check(j));
}
int main() {
int j;
scanf("%d%d", &a, &b);
ma = a; mb = b;
i = (sizeof (int) * 8) - 1;
j = 1 << i;
((a & j) && flipnum(&a, &a_neg));
((b & j) && flipnum(&b, &b_neg));
j = 1 << (i - 1);
recursive_check(j);
(!stop && printf("numbers are same..\n"));
}
I think I have a solution with 3 operations:
Add one to the first number, the subtract it from the largest possible number you can represent (all 1's). Add that number to the second number. If it it overflows, then the first number is less than the second.
I'm not 100% sure if this is correct. That is you might not need to add 1, and I don't know if it's possible to check for overflow (if not then just reserve the last bit and test if it's 1 at the end.)
EDIT: The constraints make the simple approach at the bottom invalid. I am adding the binary search function and the final comparison to detect the greater value:
unsigned long greater(unsigned long a, unsigned long b) {
unsigned long x = a;
unsigned long y = b;
unsigned long t = a ^ b;
if (t & 0xFFFF0000) {
x >>= 16;
y >>= 16;
t >>= 16;
}
if (t & 0xFF00) {
x >>= 8;
y >>= 8;
t >>= 8;
}
if (t & 0xf0) {
x >>= 4;
y >>= 4;
t >>= 4;
}
if ( t & 0xc) {
x >>= 2;
y >>= 2;
t >>= 2;
}
if ( t & 0x2) {
x >>= 1;
y >>= 1;
t >>= 1;
}
return (x & 1) ? a : b;
}
The idea is to start off with the most significant half of the word we are interested in and see if there are any set bits in there. If there are, then we don't need the least significant half, so we shift the unwanted bits away. If not, we do nothing (the half is zero anyway, so it won't get in the way). Since we cannot keep track of the shifted amount (it would require addition), we also shift the original values so that we can do the final and to determine the larger number. We repeat this process with half the size of the previous mask until we collapse the interesting bits into bit position 0.
I didn't add the equal case in here on purpose.
Old answer:
The simplest method is probably the best for a homework. Once you've got the mismatching bit value, you start off with another mask at 0x80000000 (or whatever suitable max bit position for your word size), and keep right shifting this until you hit a bit that is set in your mismatch value. If your right shift ends up with 0, then the mismatch value is 0.
I assume you already know the final step required to determine the larger number.

Rounding up to next power of 2

I want to write a function that returns the nearest next power of 2 number. For example if my input is 789, the output should be 1024. Is there any way of achieving this without using any loops but just using some bitwise operators?
Check the Bit Twiddling Hacks. You need to get the base 2 logarithm, then add 1 to that. Example for a 32-bit value:
Round up to the next highest power of 2
unsigned int v; // compute the next highest power of 2 of 32-bit v
v--;
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;
The extension to other widths should be obvious.
next = pow(2, ceil(log(x)/log(2)));
This works by finding the number you'd have raise 2 by to get x (take the log of the number, and divide by the log of the desired base, see wikipedia for more). Then round that up with ceil to get the nearest whole number power.
This is a more general purpose (i.e. slower!) method than the bitwise methods linked elsewhere, but good to know the maths, eh?
I think this works, too:
int power = 1;
while(power < x)
power*=2;
And the answer is power.
unsigned long upper_power_of_two(unsigned long v)
{
v--;
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v++;
return v;
}
If you're using GCC, you might want to have a look at Optimizing the next_pow2() function by Lockless Inc.. This page describes a way to use built-in function builtin_clz() (count leading zero) and later use directly x86 (ia32) assembler instruction bsr (bit scan reverse), just like it's described in another answer's link to gamedev site. This code might be faster than those described in previous answer.
By the way, if you're not going to use assembler instruction and 64bit data type, you can use this
/**
* return the smallest power of two value
* greater than x
*
* Input range: [2..2147483648]
* Output range: [2..2147483648]
*
*/
__attribute__ ((const))
static inline uint32_t p2(uint32_t x)
{
#if 0
assert(x > 1);
assert(x <= ((UINT32_MAX/2) + 1));
#endif
return 1 << (32 - __builtin_clz (x - 1));
}
One more, although I use cycle, but thi is much faster than math operands
power of two "floor" option:
int power = 1;
while (x >>= 1) power <<= 1;
power of two "ceil" option:
int power = 2;
x--; // <<-- UPDATED
while (x >>= 1) power <<= 1;
UPDATE
As mentioned in comments there was mistake in ceil where its result was wrong.
Here are full functions:
unsigned power_floor(unsigned x) {
int power = 1;
while (x >>= 1) power <<= 1;
return power;
}
unsigned power_ceil(unsigned x) {
if (x <= 1) return 1;
int power = 2;
x--;
while (x >>= 1) power <<= 1;
return power;
}
In standard c++20 this is included in <bit>.
The answer is simply
#include <bit>
unsigned long upper_power_of_two(unsigned long v)
{
return std::bit_ceil(v);
}
NOTE:
The solution I gave is for c++, not c, I would give an answer this question instead, but it was closed as a duplicate of this one!
For any unsigned type, building on the Bit Twiddling Hacks:
#include <climits>
#include <type_traits>
template <typename UnsignedType>
UnsignedType round_up_to_power_of_2(UnsignedType v) {
static_assert(std::is_unsigned<UnsignedType>::value, "Only works for unsigned types");
v--;
for (size_t i = 1; i < sizeof(v) * CHAR_BIT; i *= 2) //Prefer size_t "Warning comparison between signed and unsigned integer"
{
v |= v >> i;
}
return ++v;
}
There isn't really a loop there as the compiler knows at compile time the number of iterations.
Despite the question is tagged as c here my five cents. Lucky us, C++ 20 would include std::ceil2 and std::floor2 (see here). It is consexpr template functions, current GCC implementation uses bitshifting and works with any integral unsigned type.
For IEEE floats you'd be able to do something like this.
int next_power_of_two(float a_F){
int f = *(int*)&a_F;
int b = f << 9 != 0; // If we're a power of two this is 0, otherwise this is 1
f >>= 23; // remove factional part of floating point number
f -= 127; // subtract 127 (the bias) from the exponent
// adds one to the exponent if were not a power of two,
// then raises our new exponent to the power of two again.
return (1 << (f + b));
}
If you need an integer solution and you're able to use inline assembly, BSR will give you the log2 of an integer on the x86. It counts how many right bits are set, which is exactly equal to the log2 of that number. Other processors have similar instructions (often), such as CLZ and depending on your compiler there might be an intrinsic available to do the work for you.
Here's my solution in C. Hope this helps!
int next_power_of_two(int n) {
int i = 0;
for (--n; n > 0; n >>= 1) {
i++;
}
return 1 << i;
}
In x86 you can use the sse4 bit manipulation instructions to make it fast.
//assume input is in eax
mov ecx,31
popcnt edx,eax //cycle 1
lzcnt eax,eax //cycle 2
sub ecx,eax
mov eax,1
cmp edx,1 //cycle 3
jle #done //cycle 4 - popcnt says its a power of 2, return input unchanged
shl eax,cl //cycle 5
#done: rep ret //cycle 5
In c you can use the matching intrinsics.
Or jumpless, which speeds up things by avoiding a misprediction due to a jump, but slows things down by lengthening the dependency chain. Time the code to see which works best for you.
//assume input is in eax
mov ecx,31
popcnt edx,eax //cycle 1
lzcnt eax,eax
sub ecx,eax
mov eax,1 //cycle 2
cmp edx,1
mov edx,0 //cycle 3
cmovle ecx,edx //cycle 4 - ensure eax does not change
shl eax,cl
#done: rep ret //cycle 5
/*
** http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
*/
#define __LOG2A(s) ((s &0xffffffff00000000) ? (32 +__LOG2B(s >>32)): (__LOG2B(s)))
#define __LOG2B(s) ((s &0xffff0000) ? (16 +__LOG2C(s >>16)): (__LOG2C(s)))
#define __LOG2C(s) ((s &0xff00) ? (8 +__LOG2D(s >>8)) : (__LOG2D(s)))
#define __LOG2D(s) ((s &0xf0) ? (4 +__LOG2E(s >>4)) : (__LOG2E(s)))
#define __LOG2E(s) ((s &0xc) ? (2 +__LOG2F(s >>2)) : (__LOG2F(s)))
#define __LOG2F(s) ((s &0x2) ? (1) : (0))
#define LOG2_UINT64 __LOG2A
#define LOG2_UINT32 __LOG2B
#define LOG2_UINT16 __LOG2C
#define LOG2_UINT8 __LOG2D
static inline uint64_t
next_power_of_2(uint64_t i)
{
#if defined(__GNUC__)
return 1UL <<(1 +(63 -__builtin_clzl(i -1)));
#else
i =i -1;
i =LOG2_UINT64(i);
return 1UL <<(1 +i);
#endif
}
If you do not want to venture into the realm of undefined behaviour the input value must be between 1 and 2^63. The macro is also useful to set constant at compile time.
For completeness here is a floating-point implementation in bog-standard C.
double next_power_of_two(double value) {
int exp;
if(frexp(value, &exp) == 0.5) {
// Omit this case to round precise powers of two up to the *next* power
return value;
}
return ldexp(1.0, exp);
}
An efficient Microsoft (e.g., Visual Studio 2017) specific solution in C / C++ for integer input. Handles the case of the input exactly matching a power of two value by decrementing before checking the location of the most significant 1 bit.
inline unsigned int ExpandToPowerOf2(unsigned int Value)
{
unsigned long Index;
_BitScanReverse(&Index, Value - 1);
return (1U << (Index + 1));
}
// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#if defined(WIN64) // The _BitScanReverse64 intrinsic is only available for 64 bit builds because it depends on x64
inline unsigned long long ExpandToPowerOf2(unsigned long long Value)
{
unsigned long Index;
_BitScanReverse64(&Index, Value - 1);
return (1ULL << (Index + 1));
}
#endif
This generates 5 or so inlined instructions for an Intel processor similar to the following:
dec eax
bsr rcx, rax
inc ecx
mov eax, 1
shl rax, cl
Apparently the Visual Studio C++ compiler isn't coded to optimize this for compile-time values, but it's not like there are a whole lot of instructions there.
Edit:
If you want an input value of 1 to yield 1 (2 to the zeroth power), a small modification to the above code still generates straight through instructions with no branch.
inline unsigned int ExpandToPowerOf2(unsigned int Value)
{
unsigned long Index;
_BitScanReverse(&Index, --Value);
if (Value == 0)
Index = (unsigned long) -1;
return (1U << (Index + 1));
}
Generates just a few more instructions. The trick is that Index can be replaced by a test followed by a cmove instruction.
Trying to make an "ultimate" solution for this. The following code
is targeted for C language (not C++),
uses compiler built-ins to yield efficient code (CLZ or BSR instruction) if compiler supports any,
is portable (standard C and no assembly) with the exception of built-ins, and
addresses all undefined behaviors.
If you're writing in C++, you may adjust the code appropriately. Note that C++20 introduces std::bit_ceil which does the exact same thing except the behavior may be undefined on certain conditions.
#include <limits.h>
#ifdef _MSC_VER
# if _MSC_VER >= 1400
/* _BitScanReverse is introduced in Visual C++ 2005 and requires
<intrin.h> (also introduced in Visual C++ 2005). */
#include <intrin.h>
#pragma intrinsic(_BitScanReverse)
#pragma intrinsic(_BitScanReverse64)
# define HAVE_BITSCANREVERSE 1
# endif
#endif
/* Macro indicating that the compiler supports __builtin_clz().
The name HAVE_BUILTIN_CLZ seems to be the most common, but in some
projects HAVE__BUILTIN_CLZ is used instead. */
#ifdef __has_builtin
# if __has_builtin(__builtin_clz)
# define HAVE_BUILTIN_CLZ 1
# endif
#elif defined(__GNUC__)
# if (__GNUC__ > 3)
# define HAVE_BUILTIN_CLZ 1
# elif defined(__GNUC_MINOR__)
# if (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)
# define HAVE_BUILTIN_CLZ 1
# endif
# endif
#endif
/**
* Returns the smallest power of two that is not smaller than x.
*/
unsigned long int next_power_of_2_long(unsigned long int x)
{
if (x <= 1) {
return 1;
}
x--;
#ifdef HAVE_BITSCANREVERSE
if (x > (ULONG_MAX >> 1)) {
return 0;
} else {
unsigned long int index;
(void) _BitScanReverse(&index, x);
return (1UL << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (ULONG_MAX >> 1)) {
return 0;
}
return (1UL << (sizeof(x) * CHAR_BIT - __builtin_clzl(x)));
#else
/* Solution from "Bit Twiddling Hacks"
<http://www.graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2>
but converted to a loop for smaller code size.
("gcc -O3" will unroll this.) */
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}
unsigned int next_power_of_2(unsigned int x)
{
if (x <= 1) {
return 1;
}
x--;
#ifdef HAVE_BITSCANREVERSE
if (x > (UINT_MAX >> 1)) {
return 0;
} else {
unsigned long int index;
(void) _BitScanReverse(&index, x);
return (1U << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (UINT_MAX >> 1)) {
return 0;
}
return (1U << (sizeof(x) * CHAR_BIT - __builtin_clz(x)));
#else
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}
unsigned long long next_power_of_2_long_long(unsigned long long x)
{
if (x <= 1) {
return 1;
}
x--;
#if (defined(HAVE_BITSCANREVERSE) && \
ULLONG_MAX == 18446744073709551615ULL)
if (x > (ULLONG_MAX >> 1)) {
return 0;
} else {
/* assert(sizeof(__int64) == sizeof(long long)); */
unsigned long int index;
(void) _BitScanReverse64(&index, x);
return (1ULL << (index + 1));
}
#elif defined(HAVE_BUILTIN_CLZ)
if (x > (ULLONG_MAX >> 1)) {
return 0;
}
return (1ULL << (sizeof(x) * CHAR_BIT - __builtin_clzll(x)));
#else
{
unsigned int shift;
for (shift = 1; shift < sizeof(x) * CHAR_BIT; shift <<= 1) {
x |= (x >> shift);
}
}
return (x + 1);
#endif
}
Portable solution in C#:
int GetNextPowerOfTwo(int input) {
return 1 << (int)Math.Ceiling(Math.Log2(input));
}
Math.Ceiling(Math.Log2(value)) calculates the exponent of the next power of two, the 1 << calculates the real value through bitshifting.
Faster solution if you have .NET Core 3 or above:
uint GetNextPowerOfTwoFaster(uint input) {
return (uint)1 << (sizeof(uint) * 8 - System.Numerics.BitOperations.LeadingZeroCount(input - 1));
}
This uses System.Numerics.BitOperations.LeadingZeroCount() which uses a hardware instruction if available:
https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/shared/System/Numerics/BitOperations.cs
Update:
RoundUpToPowerOf2() is Coming in .NET 6! The internal implementation is mostly the same as the .NET Core 3 solution above.
Here's the community update.
You might find the following clarification to be helpful towards your purpose:
constexpr version of clp2 for C++14
#include <iostream>
#include <type_traits>
// Closest least power of 2 minus 1. Returns 0 if n = 0.
template <typename UInt, std::enable_if_t<std::is_unsigned<UInt>::value,int> = 0>
constexpr UInt clp2m1(UInt n, unsigned i = 1) noexcept
{ return i < sizeof(UInt) * 8 ? clp2m1(UInt(n | (n >> i)),i << 1) : n; }
/// Closest least power of 2 minus 1. Returns 0 if n <= 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value && std::is_signed<Int>::value,int> = 0>
constexpr auto clp2m1(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n <= 0 ? 0 : n)); }
/// Closest least power of 2. Returns 2^N: 2^(N-1) < n <= 2^N. Returns 0 if n <= 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value,int> = 0>
constexpr auto clp2(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n-1)) + 1; }
/// Next power of 2. Returns 2^N: 2^(N-1) <= n < 2^N. Returns 1 if n = 0. Returns 0 if n < 0.
template <typename Int, std::enable_if_t<std::is_integral<Int>::value,int> = 0>
constexpr auto np2(Int n) noexcept
{ return clp2m1(std::make_unsigned_t<Int>(n)) + 1; }
template <typename T>
void test(T v) { std::cout << clp2(v) << std::endl; }
int main()
{
test(-5); // 0
test(0); // 0
test(8); // 8
test(31); // 32
test(33); // 64
test(789); // 1024
test(char(260)); // 4
test(unsigned(-1) - 1); // 0
test<long long>(unsigned(-1) - 1); // 4294967296
return 0;
}
Many processor architectures support log base 2 or very similar operation – count leading zeros. Many compilers have intrinsics for it. See https://en.wikipedia.org/wiki/Find_first_set
Assuming you have a good compiler & it can do the bit twiddling before hand thats above me at this point, but anyway this works!!!
// http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
#define SH1(v) ((v-1) | ((v-1) >> 1)) // accidently came up w/ this...
#define SH2(v) ((v) | ((v) >> 2))
#define SH4(v) ((v) | ((v) >> 4))
#define SH8(v) ((v) | ((v) >> 8))
#define SH16(v) ((v) | ((v) >> 16))
#define OP(v) (SH16(SH8(SH4(SH2(SH1(v))))))
#define CB0(v) ((v) - (((v) >> 1) & 0x55555555))
#define CB1(v) (((v) & 0x33333333) + (((v) >> 2) & 0x33333333))
#define CB2(v) ((((v) + ((v) >> 4) & 0xF0F0F0F) * 0x1010101) >> 24)
#define CBSET(v) (CB2(CB1(CB0((v)))))
#define FLOG2(v) (CBSET(OP(v)))
Test code below:
#include <iostream>
using namespace std;
// http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
#define SH1(v) ((v-1) | ((v-1) >> 1)) // accidently guess this...
#define SH2(v) ((v) | ((v) >> 2))
#define SH4(v) ((v) | ((v) >> 4))
#define SH8(v) ((v) | ((v) >> 8))
#define SH16(v) ((v) | ((v) >> 16))
#define OP(v) (SH16(SH8(SH4(SH2(SH1(v))))))
#define CB0(v) ((v) - (((v) >> 1) & 0x55555555))
#define CB1(v) (((v) & 0x33333333) + (((v) >> 2) & 0x33333333))
#define CB2(v) ((((v) + ((v) >> 4) & 0xF0F0F0F) * 0x1010101) >> 24)
#define CBSET(v) (CB2(CB1(CB0((v)))))
#define FLOG2(v) (CBSET(OP(v)))
#define SZ4 FLOG2(4)
#define SZ6 FLOG2(6)
#define SZ7 FLOG2(7)
#define SZ8 FLOG2(8)
#define SZ9 FLOG2(9)
#define SZ16 FLOG2(16)
#define SZ17 FLOG2(17)
#define SZ127 FLOG2(127)
#define SZ1023 FLOG2(1023)
#define SZ1024 FLOG2(1024)
#define SZ2_17 FLOG2((1ul << 17)) //
#define SZ_LOG2 FLOG2(SZ)
#define DBG_PRINT(x) do { std::printf("Line:%-4d" " %10s = %-10d\n", __LINE__, #x, x); } while(0);
uint32_t arrTble[FLOG2(63)];
int main(){
int8_t n;
DBG_PRINT(SZ4);
DBG_PRINT(SZ6);
DBG_PRINT(SZ7);
DBG_PRINT(SZ8);
DBG_PRINT(SZ9);
DBG_PRINT(SZ16);
DBG_PRINT(SZ17);
DBG_PRINT(SZ127);
DBG_PRINT(SZ1023);
DBG_PRINT(SZ1024);
DBG_PRINT(SZ2_17);
return(0);
}
Outputs:
Line:39 SZ4 = 2
Line:40 SZ6 = 3
Line:41 SZ7 = 3
Line:42 SZ8 = 3
Line:43 SZ9 = 4
Line:44 SZ16 = 4
Line:45 SZ17 = 5
Line:46 SZ127 = 7
Line:47 SZ1023 = 10
Line:48 SZ1024 = 10
Line:49 SZ2_16 = 17
I'm trying to get nearest lower power of 2 and made this function. May it help you.Just multiplied nearest lower number times 2 to get nearest upper power of 2
int nearest_upper_power(int number){
int temp=number;
while((number&(number-1))!=0){
temp<<=1;
number&=temp;
}
//Here number is closest lower power
number*=2;
return number;
}
Adapted Paul Dixon's answer to Excel, this works perfectly.
=POWER(2,CEILING.MATH(LOG(A1)/LOG(2)))
A variant of #YannDroneaud answer valid for x==1, only for x86 plateforms, compilers, gcc or clang:
__attribute__ ((const))
static inline uint32_t p2(uint32_t x)
{
#if 0
assert(x > 0);
assert(x <= ((UINT32_MAX/2) + 1));
#endif
int clz;
uint32_t xm1 = x-1;
asm(
"lzcnt %1,%0"
:"=r" (clz)
:"rm" (xm1)
:"cc"
);
return 1 << (32 - clz);
}
Here is what I'm using to have this be a constant expression, if the input is a constant expression.
#define uptopow2_0(v) ((v) - 1)
#define uptopow2_1(v) (uptopow2_0(v) | uptopow2_0(v) >> 1)
#define uptopow2_2(v) (uptopow2_1(v) | uptopow2_1(v) >> 2)
#define uptopow2_3(v) (uptopow2_2(v) | uptopow2_2(v) >> 4)
#define uptopow2_4(v) (uptopow2_3(v) | uptopow2_3(v) >> 8)
#define uptopow2_5(v) (uptopow2_4(v) | uptopow2_4(v) >> 16)
#define uptopow2(v) (uptopow2_5(v) + 1) /* this is the one programmer uses */
So for instance, an expression like:
uptopow2(sizeof (struct foo))
will nicely reduce to a constant.
The g++ compiler provides a builtin function __builtin_clz that counts leading zeros:
So we could do:
int nextPowerOfTwo(unsigned int x) {
return 1 << sizeof(x)*8 - __builtin_clz(x);
}
int main () {
std::cout << nextPowerOfTwo(7) << std::endl;
std::cout << nextPowerOfTwo(31) << std::endl;
std::cout << nextPowerOfTwo(33) << std::endl;
std::cout << nextPowerOfTwo(8) << std::endl;
std::cout << nextPowerOfTwo(91) << std::endl;
return 0;
}
Results:
8
32
64
16
128
But note that, for x == 0, __builtin_clz return is undefined.
If you need it for OpenGL related stuff:
/* Compute the nearest power of 2 number that is
* less than or equal to the value passed in.
*/
static GLuint
nearestPower( GLuint value )
{
int i = 1;
if (value == 0) return -1; /* Error! */
for (;;) {
if (value == 1) return i;
else if (value == 3) return i*4;
value >>= 1; i *= 2;
}
}
Convert it to a float and then use .hex() which shows the normalized IEEE representation.
>>> float(789).hex()
'0x1.8a80000000000p+9'
Then just extract the exponent and add 1.
>>> int(float(789).hex().split('p+')[1]) + 1
10
And raise 2 to this power.
>>> 2 ** (int(float(789).hex().split('p+')[1]) + 1)
1024
from math import ceil, log2
pot_ceil = lambda N: 0x1 << ceil(log2(N))
Test:
for i in range(10):
print(i, pot_ceil(i))
Output:
1 1
2 2
3 4
4 4
5 8
6 8
7 8
8 8
9 16
10 16
import sys
def is_power2(x):
return x > 0 and ((x & (x - 1)) == 0)
def find_nearest_power2(x):
if x <= 0:
raise ValueError("invalid input")
if is_power2(x):
return x
else:
bits = get_bits(x)
upper = 1 << (bits)
lower = 1 << (bits - 1)
mid = (upper + lower) // 2
if (x - mid) > 0:
return upper
else:
return lower
def get_bits(x):
"""return number of bits in binary representation"""
if x < 0:
raise ValueError("invalid input: input should be positive integer")
count = 0
while (x != 0):
try:
x = x >> 1
except TypeError as error:
print(error, "input should be of type integer")
sys.exit(1)
count += 1
return count

What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).
GCC has:
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.
A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.
To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)
There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).
(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)
Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.
Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.
This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD
This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}
Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.
Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));
unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit
What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?
Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}
Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.
I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}
A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")
thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}
Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.
As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}
I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.
c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409
Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}
Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.
Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.
I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.
This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}
There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).
The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1
My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.
Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.
Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.

Resources