Square root source code found on the net - c

I'm looking for some algorithm for square root calculation and found this source file. I would like to try to replicate it because it seems quite simple but I can not relate it to some known algorithm (Newton, Babylon ...). Can you tell me the name?
int sqrt(int num) {
int op = num;
int res = 0;
int one = 1 << 30; // The second-to-top bit is set: 1L<<30 for long
// "one" starts at the highest power of four <= the argument.
while (one > op)
one >>= 2;
while (one != 0) {
if (op >= res + one) {
op -= res + one;
res += 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}

As #Eugene Sh. references, this is the classic "digit-by-digit" method done to calculate square root. Learned it in base 10 when such things were taught in primary school.
OP's code fails select numbers too. sqrt(1073741824) --> -1 rather than expected 32768. 1073741824 == 0x40000000. Further, it fails most (all?) values this and greater. Of course OP's sqrt(some_negative) is a problem too.
Candidate alternative: also here
unsigned isqrt(unsigned num) {
unsigned res = 0;
// The second-to-top bit is set: 1 << 30 for 32 bits
// Needs work to run on unusual platforms where `unsigned` has padding or odd bit width.
unsigned bit = 1u << (sizeof(num) * CHAR_BIT - 2);
// "bit" starts at the highest power of four <= the argument.
while (bit > num) {
bit >>= 2;
}
while (bit > 0) {
if (num >= res + bit) {
num -= res + bit;
res = (res >> 1) + bit; // Key difference between this and OP's code
} else {
res >>= 1;
}
bit >>= 2;
}
return res;
}
Portability update. The greatest power of 4 is needed.
#include <limits.h>
// greatest power of 4 <= a power-of-2 minus 1
#define POW4_LE_POW2M1(n) ( ((n)/2 + 1) >> ((n)%3==0) )
unsigned bit = POW4_LE_POW2M1(UINT_MAX);

Related

random 4 digit number with non repeating digits in C

I'm trying to make a get_random_4digit function that generates a 4 digit number that has non-repeating digits ranging from 1-9 while only using ints, if, while and functions, so no arrays etc.
This is the code I have but it is not really working as intended, could anyone point me in the right direction?
int get_random_4digit() {
int d1 = rand() % 9 + 1;
int d2 = rand() % 9 + 1;
while (true) {
if (d1 != d2) {
int d3 = rand() % 9 + 1;
if (d3 != d1 || d3 != d2) {
int d4 = rand() % 9 + 1;
if (d4 != d1 || d4 != d2 || d4 != d3) {
random_4digit = (d1 * 1000) + (d2 * 100) + (d3 * 10) + d4;
break;
}
}
}
}
printf("Random 4digit = %d\n", random_4digit);
}
A KISS-approach could be this:
int getRandom4Digits() {
uint16_t acc = 0;
uint16_t used = 0;
for (int i = 0; i < 4; i++) {
int idx;
do {
idx = rand() % 9; // Not equidistributed but never mind...
} while (used & (1 << idx));
acc = acc * 10 + (idx + 1);
used |= (1 << idx);
}
return acc;
}
This looks terribly dumb at first. A quick analysis gives that this really isn't so bad, giving a number of calls to rand() to be about 4.9.
The expected number of inner loop steps [and corresponding calls to rand(), if we assume rand() % 9 to be i.i.d.] will be:
9/9 + 9/8 + 9/7 + 9/6 ~ 4.9107.
There are 9 possibilities for the first digit, 8 possibilities for the second digit, 7 possibilities for the third digit and 6 possibilities for the last digit. This works out to "9*8*7*6 = 3024 permutations".
Start by getting a random number from 0 to 3023. Let's call that P. To do this without causing a biased distribution use something like do { P = rand() & 0xFFF; } while(P >= 3024);.
Note: If you don't care about uniform distribution you could just do P = rand() % 3024;. In this case lower values of P will be more likely because RAND_MAX doesn't divide by 3024 nicely.
The first digit has 9 possibilities, so do d1 = P % 9 + 1; P = P / 9;.
The second digit has 8 possibilities, so do d2 = P % 8 + 1; P = P / 8;.
The third digit has 7 possibilities, so do d3 = P % 7 + 1; P = P / 7;.
For the last digit you can just do d4 = P + 1; because we know P can't be too high.
Next; convert "possibility" into a digit. For d1 you do nothing. For d2 you need to increase it if it's greater than or equal to d1, like if(d2 >= d1) d2++;. Do the same for d3 and d4 (comparing against all previous digits).
The final code will be something like:
int get_random_4digit() {
int P, d1, d2, d3, d4;
do {
P = rand() & 0xFFF;
} while(P >= 3024);
d1 = P % 9 + 1; P = P / 9;
d2 = P % 8 + 1; P = P / 8;
d3 = P % 7 + 1; P = P / 7;
d4 = P + 1;
if(d2 >= d1) d2++;
if(d3 >= d1) d3++;
if(d3 >= d2) d3++;
if(d4 >= d1) d4++;
if(d4 >= d2) d4++;
if(d4 >= d3) d4++;
return d1*1000 + d2*100 + d3*10 + d4;
}
You could start with an integer number, 0x123456789, and pick random nibbles from it (the 4 bits that makes up one of the digits in the hex value). When a nibble has been selected, remove it from the number and continue picking from those left.
This makes exactly 4 calls to rand() and has no if or other conditions (other than the loop condition).
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int get_random_4digit() {
uint64_t bits = 0x123456789; // nibbles
int res = 0;
// pick random nibbles
for(unsigned last = 9 - 1; last > 9 - 1 - 4; --last) {
unsigned lsh = last * 4; // shift last nibble
unsigned sel = (rand() % (last + 1)) * 4; // shift for random nibble
// multiply with 10 and add the selected nibble
res = res * 10 + ((bits & (0xFULL << sel)) >> sel);
// move the last unselected nibble right to where the selected
// nibble was:
bits = (bits & ~(0xFULL << sel)) |
((bits & (0xFULL << lsh)) >> (lsh - sel));
}
return res;
}
Demo
Another variant could be to use the same value, 0x123456789, and do a Fisher-Yates shuffle on the nibbles. When the shuffle is done, return the 4 lowest nibbles. This is more expensive since it randomizes the order of all 9 nibbles - but it makes it easy if you want to select an arbitrary amount of them afterwards.
Example:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
uint16_t get_random_4digit() {
uint64_t bits = 0x123456789; // nibbles
// shuffle the nibbles
for(unsigned idx = 9 - 1; idx > 0; --idx) {
unsigned ish = idx * 4; // index shift
// shift for random nibble to swap with `idx`
unsigned swp = (rand() % (idx + 1)) * 4;
// extract the selected nibbles
uint64_t a = (bits & (0xFULL << ish)) >> ish;
uint64_t b = (bits & (0xFULL << swp)) >> swp;
// swap them
bits &= ~((0xFULL << ish) | (0xFULL << swp));
bits |= (a << swp) | (b << ish);
}
return bits & 0xFFFF; // return the 4 lowest nibbles
}
The bit manipulation can probably be optimized - but I wrote it like I thought it so it's probably better for readability to leave it as-is
You can then print the value as a hex value to get the output you want - or extract the 4 nibbles and convert it for decimal output.
int main() {
srand(time(NULL));
uint16_t res = get_random_4digit();
// print directly as hex:
printf("%X\n", res);
// or extract the nibbles and multiply to get decimal result - same output:
uint16_t a = (res >> 12) & 0xF;
uint16_t b = (res >> 8) & 0xF;
uint16_t c = (res >> 4) & 0xF;
uint16_t d = (res >> 0) & 0xF;
uint16_t dec = a * 1000 + b * 100 + c * 10 + d;
printf("%d\n", dec);
}
Demo
You should keep generating digits until distinct one found:
int get_random_4digit() {
int random_4digit = 0;
/* We must have 4 digits number - at least 1234 */
while (random_4digit < 1000) {
int digit = rand() % 9 + 1;
/* check if generated digit is not in the result */
for (int number = random_4digit; number > 0; number /= 10)
if (number % 10 == digit) {
digit = 0; /* digit has been found, we'll try once more */
break;
}
if (digit > 0) /* unique digit generated, we add it to result */
random_4digit = random_4digit * 10 + digit;
}
return random_4digit;
}
Please, fiddle youself
One way to do this is to create an array with all 9 digits, pick a random one and remove it from the list.
Something like this:
uint_fast8_t digits[]={1,2,3,4,5,6,7,8,9}; //only 1-9 are allowed, 0 is not allowed
uint_fast8_t left=4; //How many digits are left to create
unsigned result=0; //Store the 4-digit number here
while(left--)
{
uint_fast8_t digit=getRand(9-4+left); //pick a random index
result=result*10+digits[digit];
//Move all digits above the selcted one 1 index down.
//This removes the picked digit from the array.
while(digit<8)
{
digits[digit]=digits[digit+1];
digit++;
}
}
You said you need a solution without arrays. Luckily, we can store up to 16 4 bit numbers in a single uint64_t. Here is an example that uses a uint64_t to store the digit list so that no array is needed.
#include <stdint.h>
#include <inttypes.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
unsigned getRand(unsigned max)
{
return rand()%(max+1);
}
//Creates a uint64_t that is used as an array.
//Use no more than 16 values and don't use values >0x0F
//The last argument will have index 0
uint64_t fakeArrayCreate(uint_fast8_t count, ...)
{
uint64_t result=0;
va_list args;
va_start (args, count);
while(count--)
{
result=(result<<4) | va_arg(args,int);
}
return result;
}
uint_fast8_t fakeArrayGet(uint64_t array, uint_fast8_t index)
{
return array>>(4*index)&0x0F;
}
uint64_t fakeArraySet(uint64_t array, uint_fast8_t index, uint_fast8_t value)
{
array = array & ~((uint64_t)0x0F<<(4*index));
array = array | ((uint64_t)value<<(4*index));
return array;
}
unsigned getRandomDigits(void)
{
uint64_t digits = fakeArrayCreate(9,9,8,7,6,5,4,3,2,1);
uint_fast8_t left=4;
unsigned result=0;
while(left--)
{
uint_fast8_t digit=getRand(9-4+left);
result=result*10+fakeArrayGet(digits,digit);
//Move all digits above the selcted one 1 index down.
//This removes the picked digit from the array.
while(digit<8)
{
digits=fakeArraySet(digits,digit,fakeArrayGet(digits,digit+1));
digit++;
}
}
return result;
}
//Test our function
int main(int argc, char **argv)
{
srand(atoi(argv[1]));
printf("%u\n",getRandomDigits());
}
You could use a partial Fisher-Yates shuffle on an array of 9 digits, stopping after 4 digits:
// Return random integer from 0 to n-1
// (for n in range 1 to RAND_MAX+1u).
int get_random_int(unsigned int n) {
unsigned int x = (RAND_MAX + 1u) / n;
unsigned int limit = x * n;
int s;
do {
s = rand();
} while (s >= limit);
return s / x;
}
// Return random 4-digit number from 1234 to 9876 with no
// duplicate digits and no 0 digit.
int get_random_4digit(void) {
char possible[9] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
int result = 0;
int i;
// Uses partial Fisher-Yates shuffle.
for (i = 0; i < 4; i++) {
// Get random position rand_pos from remaining possibilities i to 8
// (positions before i contain previous selected digits).
int rand_pos = i + get_random_int(9 - i);
// Select digit from position rand_pos.
char digit = possible[rand_pos];
// Exchange digits at positions i and rand_pos.
possible[rand_pos] = possible[i];
possible[i] = digit; // not really needed
// Put selected digit into result.
result = result * 10 + digit;
}
return result;
}
EDIT: I forgot the requirement "while only using int's, if, while and functions, so no arrays etc.", so feel free to ignore this answer!
If normal C integer types are allowed including long long int, the get_random_4digit() function above can be replaced with the following to satisfy the requirement:
// Return random 4-digit number from 1234 to 9876 with no
// duplicate digits and no 0 digit.
int get_random_4digit(void) {
long long int possible = 0x123456789; // 4 bits per digit
int result = 0;
int i;
// Uses partial Fisher-Yates shuffle.
i = 0;
while (i < 4) {
// Determine random position rand_pos in remaining possibilities 0 to 8-i.
int rand_pos = get_random_int(9 - i);
// Select digit from position rand_pos.
int digit = (possible >> (4 * rand_pos)) & 0xF;
// Replace digit at position rand_pos with digit at position 0.
possible ^= ((possible ^ digit) & 0xF) << (4 * rand_pos);
// Shift remaining possible digits down one position.
possible >>= 4;
// Put selected digit into result.
result = result * 10 + digit;
i++;
}
return result;
}
There are multiple answers to this question already, but none of them seem to fit the requirement only using ints, if, while and functions. Here is a modified version of Pelle Evensen's simple solution:
#include <stdlib.h>
int get_random_4digit(void) {
int acc = 0, used = 0, i = 0;
while (i < 4) {
int idx = rand() % 9; // Not strictly uniform but never mind...
if (!(used & (1 << idx))) {
acc = acc * 10 + idx + 1;
used |= 1 << idx;
i++;
}
}
return acc;
}

64 bit / 64 bit remainder finding algorithm on a 32 bit processor?

I know that similar questions has been asked in the past, but I have implemented after a long process the algorithm to find the quotient correctly using the division by repeated subtraction method. But I am not able to find out the remainder from this approach. Is there any quick and easy way for finding out remainder in 64bit/64bit division on 32bit processor. To be more precise I am trying to implement
ulldiv_t __aeabi_uldivmod(
unsigned long long n, unsigned long long d)
Referenced in this document http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf
What? If you do repeated subtraction (which sounds really basic), then isn't it as simple as whatever you have left when you can't do another subtraction is the remainder?
At least that's the naïve intuitive way:
uint64_t simple_divmod(uint64_t n, uint64_t d)
{
if (n == 0 || d == 0)
return 0;
uint64_t q = 0;
while (n >= d)
{
++q;
n -= d;
}
return n;
}
Or am I missing the boat, here?
Of course this will be fantastically slow for large numbers, but this is repeated subtraction. I'm sure (even without looking!) there are more advanced algorithms.
This is a division algorithm, run in O(log(n/d))
uint64_t slow_division(uint64_t n, uint64_t d)
{
uint64_t i = d;
uint64_t q = 0;
uint64_t r = n;
while (n > i && (i >> 63) == 0) i <<= 1;
while (i >= d) {
q <<= 1;
if (r >= i) { r -= i; q += 1; }
i >>= 1;
}
// quotient is q, remainder is r
return q; // return r
}
q (quotient) can be removed if you need only r (remainder). You can implement each of the intermediate variables i,q,r as a pair of uint32_t, e.g. i_lo, i_hi, q_lo, q_hi ..... shift, add and subtract lo and hi are simple operations.
#define left_shift1 (a_hi, a_lo) // a <<= 1
{
a_hi = (a_hi << 1) | (a_lo >> 31)
a_lo = (a_lo << 1)
}
#define subtraction (a_hi, a_lo, b_hi, b_lo) // a-= b
{
uint32_t t = a_lo
a_lo -= b_lo
t = (a_lo > t) // borrow
a_hi -= b_hi + t
}
#define right_shift63 (a_hi, a_lo) // a >> 63
{
a_lo = a_hi >> 31;
a_hi = 0;
}
and so on.
0 as divisor is still an unresolved challenge :-) .

Set n highest bits twiddling

I have the following function that sets the N highest bits, e.g. set_n_high(8) == 0xff00000000000000
uint64_t set_n_high(int n)
{
uint64_t v = 0;
int i;
for (i = 63 ; i > 63 - n; i--) {
v |= (1llu << i);
}
return v;
}
Now just out of curiosity, is there any way in C to accomplish the same without using a loop (or a lookup table) ?
EDIT: n = 0 and n = 64 are cases to be handled, just as the loop variant does.
If you're OK with the n = 0 case not working, you can simplify it to
uint64_t set_n_high(int n)
{
return ~UINT64_C(0) << (64 - n);
}
If, in addition to that, you're OK with "weird shift counts" (undefined behaviour, but Works On My Machine), you can simplify that even further to
uint64_t set_n_high(int n)
{
return ~UINT64_C(0) << -n;
}
If you're OK with the n = 64 case not working, you can simplify it to
uint64_t set_n_high(int n)
{
return ~(~UINT64_C(0) >> n);
}
If using this means that you have to validate n, it won't be faster. Otherwise, it might be.
If you're not OK with either case not working, it gets trickier. Here's a suggestion (there may be a better way)
uint64_t set_n_high(int n)
{
return ~(~UINT64_C(0) >> (n & 63)) | -(uint64_t)(n >> 6);
}
Note that negating an unsigned number is perfectly well-defined.
uint64_t set_n_high(int n) {
return ((1llu << n) - 1) << (64-n);
}
Use a conditional to handle n == 0 and then it becomes trivial.
uint64_t set_n_high(int n) {
/* optional error checking:
if (n < 0 || n > 64) do something */
if (n == 0) return 0;
return -(uint64_t)1 << 64 - n;
}
There’s really no good reason to do anything more complicated than that. The cast from int to uint64_t is fully specified, as are the negation and shift (because the shift amount is guaranteed to lie in [0,63] if n is in [0,64]).
well taking #harold's answer and changing it a little:
uint64_t set_n_high(int n)
{
int carry = n>>6;
return ~((~0uLL >> (n-carry)) >> carry);
}
For what it's worth, of the posts so far (that handle n of 0-64), this one produces the least amount of assembly on an x86_64 and a raspberry pi (and does 1 branch operation) (with gcc 4.8.2). It looks fairly readable too.
uint64_t set_n_high2(int n)
{
uint64_t v = 0;
if (n != 0) {
v = ~UINT64_C(0) << (64 - n);
}
return v;
}
Well I'm presenting a weird-looking one.
:)
/* works for 0<=n<=64 */
uint64_t set_n_high(int n)
{
return ~0llu << ((64 - n) / 4) << ((64 - n) * 3 / 4);
}

How to manually (bitwise) perform (float)x?

Now, here is the function header of the function I'm supposed to implement:
/*
* float_from_int - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_from_int(int x) {
...
}
We aren't allowed to do float operations, or any kind of casting.
Now I tried to implement the first algorithm given at this site: http://locklessinc.com/articles/i2f/
Here's my code:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if(x == 0x80000000){ //Updated to add this
return 0xcf000000;
}
//int shiftsNeeded = 0;
/*while(){
shiftsNeeded++;
}*/
unsigned I2F_MAX_BITS = 15;
unsigned I2F_MAX_INPUT = ((1 << I2F_MAX_BITS) - 1);
unsigned I2F_SHIFT = (24 - I2F_MAX_BITS);
unsigned result, i, exponent, fraction;
if ((absValOfX & I2F_MAX_INPUT) == 0)
result = 0;
else {
exponent = 126 + I2F_MAX_BITS;
fraction = (absValOfX & I2F_MAX_INPUT) << I2F_SHIFT;
i = 0;
while(i < I2F_MAX_BITS) {
if (fraction & 0x800000)
break;
else {
fraction = fraction << 1;
exponent = exponent - 1;
}
i++;
}
result = (xIsNegative << 31) | exponent << 23 | (fraction & 0x7fffff);
}
return result;
}
But it didn't work (see test error below):
ERROR: Test float_from_int(8388608[0x800000]) failed...
...Gives 0[0x0]. Should be 1258291200[0x4b000000]
I don't know where to go from here. How should I go about parsing the float from this int?
EDIT #1:
You might be able to see from my code that I also started working on this algorithm (see this site):
I assumed 10-bit, 2’s complement, integers since the mantissa is only
9 bits, but the process generalizes to more bits.
Save the sign bit of the input and take the absolute value of the input.
Shift the input left until the high order bit is set and count the number of shifts required. This forms the floating mantissa.
Form the floating exponent by subtracting the number of shifts from step 2 from the constant 137 or (0h89-(#of shifts)).
Assemble the float from the sign, mantissa, and exponent.
But, that doesn't seem right. How could I convert 0x80000000? Doesn't make sense.
EDIT #2:
I think it's because I say max bits is 15... hmmm...
EDIT #3: Screw that old algorithm, I'm starting over:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if (x == 0x80000000){
return 0xcf000000;
}
int shiftsNeeded = 0;
int counter = 0;
while(((absValOfX >> counter) & 1) != 1 && shiftsNeeded < 32){
counter++;
shiftsNeeded++;
}
unsigned exponent = shiftsNeeded + 127;
unsigned result = (xIsNegative << 31) | (exponent << 23);
return result;
Here's the error I get on this one (I think I got past the last error):
ERROR: Test float_from_int(-2139095040[0x80800000]) failed...
...Gives -889192448[0xcb000000]. Should be -822149120[0xceff0000]
May be helpful to know that:
absValOfX = 7f800000
(using printf)
EDIT #4: Ah, I'm finding the exponent wrong, need to count from the left, then subtract from 32 I believe.
EDIT #5: I started over, now trying to deal with weird rounding problems...
if (x == 0){
return 0; // 0 is a special case because it has no 1 bits
}
if (x >= 0x80000000 && x <= 0x80000040){
return 0xcf000000;
}
// Save the sign bit of the input and take the absolute value of the input.
unsigned signBit = 0;
unsigned absX = (unsigned)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (unsigned)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
unsigned negativeRoundUp = (absX >> 7) & 1 & (absX >> 8);
// compute mantissa
unsigned mantissa = (absX >> 8) + ((negativeRoundUp) || (!signBit & (absX >> 7) & (exponent < 156)));
printf("absX = %x, absX >> 8 = %x, exponent = %i, mantissa = %x\n", absX, (absX >> 8), exponent, mantissa);
// Assemble the float from the sign, mantissa, and exponent.
return signBit | ((exponent << 23) + (signBit & negativeRoundUp)) | ( (mantissa) & 0x7fffff);
-
absX = fe000084, absX >> 8 = fe0000, exponent = 156, mantissa = fe0000
ERROR: Test float_from_int(1065353249[0x3f800021]) failed...
...Gives 1316880384[0x4e7e0000]. Should be 1316880385[0x4e7e0001]
EDIT #6
Did it again, still, the rounding doesn't work properly. I've tried to hack together some rounding, but it just won't work...
unsigned float_from_int(int x) {
/*
If N is negative, negate it in two's complement. Set the high bit (2^31) of the result.
If N < 2^23, left shift it (multiply by 2) until it is greater or equal to.
If N ≥ 2^24, right shift it (unsigned divide by 2) until it is less.
Bitwise AND with ~2^23 (one's complement).
If it was less, subtract the number of left shifts from 150 (127+23).
If it was more, add the number of right shifts to 150.
This new number is the exponent. Left shift it by 23 and add it to the number from step 3.
*/
printf("---------------\n");
//printf("x = %i (%x), -x = %i, (%x)\n", x, x, -x, -x);
if(x == 0){
return 0;
}
if(x == 0x80000000){
return 0xcf000000;
}
// If N is negative, negate it in two's complement. Set the high bit of the result
unsigned signBit = 0;
if (x < 0){
signBit = 0x80000000;
x = -x;
}
printf("abs val of x = %i (%x)\n", x, x);
int roundTowardsZero = 0;
int lastDigitLeaving = 0;
int shiftAmount = 0;
int originalAbsX = x;
// If N < 2^23, left shift it (multiply it by 2) until it is great or equal to.
if(x < (8388608)){
while(x < (8388608)){
//printf(" minus shift and x = %i", x );
x = x << 1;
shiftAmount--;
}
} // If N >= 2^24, right shfit it (unsigned divide by 2) until it is less.
else if(x >= (16777215)){
while(x >= (16777215)){
/*if(x & 1){
roundTowardsZero = 1;
printf("zzz Got here ---");
}*/
lastDigitLeaving = (x >> 1) & 1;
//printf(" plus shift and x = %i", x);
x = x >> 1;
shiftAmount++;
}
//Round towards zero
x = (x + (lastDigitLeaving && (!(originalAbsX > 16777216) || signBit)));
printf("x = %i\n", x);
//shiftAmount = shiftAmount + roundTowardsZero;
}
printf("roundTowardsZero = %i, shiftAmount = %i (%x)\n", roundTowardsZero, shiftAmount, shiftAmount);
// Bitwise AND with 0x7fffff
x = x & 0x7fffff;
unsigned exponent = 150 + shiftAmount;
unsigned rightPlaceExponent = exponent << 23;
printf("exponent = %i, rightPlaceExponent = %x\n", exponent, rightPlaceExponent);
unsigned result = signBit | rightPlaceExponent | x;
return result;
The problem is that the lowest int is -2147483648, but the highest is 2147483647, so there is no absolute value of -2147483648. While you could work around it, I would just make a special case for that one bit pattern (like you do for 0):
if (x == 0)
return 0;
if (x == -2147483648)
return 0xcf000000;
The other problem is that you copied an algorithm that only works for numbers from 0 to 32767. Further down in the article they explain how to expand it to all ints, but it uses operations that you're likely not allowed to use.
I would recommend writing it from scratch based on the algorithm mentioned in your edit. Here's a version in C# that rounds towards 0:
uint float_from_int(int x)
{
if (x == 0)
return 0; // 0 is a special case because it has no 1 bits
// Save the sign bit of the input and take the absolute value of the input.
uint signBit = 0;
uint absX = (uint)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (uint)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
uint exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
// compute mantissa
uint mantissa = absX >> 8;
// Assemble the float from the sign, mantissa, and exponent.
return signBit | (exponent << 23) | (mantissa & 0x7fffff);
}
The basic formulation of the algorithm is to determine the sign, exponent and mantissa bits, then pack the result into an integer. Breaking it down this way makes it easy to clearly separate the tasks in code and makes solving the problem (and testing your algorithm) much easier.
The sign bit is the easiest, and getting rid of it makes finding the exponent easier. You can distinguish four cases: 0, 0x80000000, [-0x7ffffff, -1], and [1, 0x7fffffff]. The first two are special cases, and you can trivially get the sign bit in the last two cases (and the absolute value of the input). If you're going to cast to unsigned, you can get away with not special-casing 0x80000000 as I mentioned in a comment.
Next up, find the exponent -- there's an easy (and costly) looping way, and a trickier but faster way to do this. My absolute favourite page for this is Sean Anderson's bit hacks page. One of the algorithms shows a very quick loop-less way to find the log2 of an integer in only seven operations.
Once you know the exponent, then finding the mantissa is easy. You just drop the leading one bit, then shift the result either left or right depending on the exponent's value.
If you use the fast log2 algorithm, you can probably end up with an algorithm which uses no more than 20 operations.
Dealing with 0x80000000 is pretty easy:
int xIsNegative = 0;
unsigned int absValOfX = x;
if (x < 0)
{
xIsNegative = 1;
absValOfX = -(unsigned int)x;
}
It gets rid of special casing -2147483648 since that value is representable as an unsigned value, and absValOfX should always be positive.

What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?
I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.
Is there some really obvious way of doing this that I'm missing?
What about in cases where you can't use POSIX functions for portability?
EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).
GCC has:
-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.
-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.
-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.
I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.
A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.
To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)
There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).
(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)
Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
unsigned int v;
unsigned r = 0;
while (v >>= 1) {
r++;
}
This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.
If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.
NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.
Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:
Searches the source operand for the most significant set
bit (1 bit). If a most significant 1
bit is found, its bit index is stored
in the destination operand. The source operand can be a
register or a memory location; the
destination operand is a register. The
bit index is an unsigned offset from
bit 0 of the source operand. If the
content source operand is 0, the
content of the destination operand is
undefined.
(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)
Example code for gcc:
#include <iostream>
int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.
This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.
My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.
double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023; // assumes x86 endianness
This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.
It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.
To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.
// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?
double ff=(double)(v|1);
uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.
This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).
This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD
This should be lightning fast:
int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}
Kaz Kylheku here
I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.
(I happen to need this "find highest bit" for something, you see.)
I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.
The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.
The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.
Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).
int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}
int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;
if (n == 0)
return 0;
for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;
if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}
return lo + 1;
}
Quick and dirty test program:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int highest_bit_unrolled(long long n);
int highest_bit(long long n);
main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;
for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);
mid = clock();
for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);
end = clock();
printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
Using only -O2, the difference becomes greater. The decision tree is almost four times faster.
I also benchmarked against the naive bit shifting code:
int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!
On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.
I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.
Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.
[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:
asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));
unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.
From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit
What about
int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
?
Here are some (simple) benchmarks, of algorithms currently given on this page...
The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)
On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...
//////// go.c ///////////////////////////////
// compile with: gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/***************** math ********************/
#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */ \
((unsigned) log2(a)) /* thus: do not use if a <= 0 */
#define NUM_OF_HIGHESTBITmath(a) ((a) \
? (1U << POS_OF_HIGHESTBITmath(a)) \
: 0)
/***************** clz ********************/
unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */
#define NUM_OF_HIGHESTBITclz(a) ((a) \
? (1U << POS_OF_HIGHESTBITclz(a)) \
: 0)
/***************** i2f ********************/
double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)
#define NUM_OF_HIGHESTBITi2f(a) ((a) \
? (1U << POS_OF_HIGHESTBITi2f(a)) \
: 0)
/***************** asm ********************/
unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)
#define NUM_OF_HIGHESTBITasm(a) ((a) \
? (1U << POS_OF_HIGHESTBITasm(a)) \
: 0)
/***************** bitshift1 ********************/
#define NUM_OF_HIGHESTBITbitshift1(a) (({ \
OUT = a; \
OUT |= (OUT >> 1); \
OUT |= (OUT >> 2); \
OUT |= (OUT >> 4); \
OUT |= (OUT >> 8); \
OUT |= (OUT >> 16); \
}), (OUT & ~(OUT >> 1))) \
/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
#define POS_OF_HIGHESTBITbitshift2(a) (({ \
OUT = a; \
OUT |= OUT >> 1; \
OUT |= OUT >> 2; \
OUT |= OUT >> 4; \
OUT |= OUT >> 8; \
OUT |= OUT >> 16; \
OUT = (OUT >> 1) + 1; \
}), POS[(OUT * 0x077CB531UL) >> 27])
#define NUM_OF_HIGHESTBITbitshift2(a) ((a) \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)
#define LOOPS 100000000U
int main()
{
time_t start, end;
unsigned ui;
unsigned n;
/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));
printf("\n\n");
printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));
printf("\n\n");
printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));
printf("\n\n");
printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}
printf("\n\n");
printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}
printf("\n\n");
printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}
printf("\n\nPlease wait...\n\n");
/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);
printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");
return EXIT_SUCCESS;
}
Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).
If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.
u8 inline log2 (u32 val) {
u8 k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k = 16; }
if (val > 0x000000FFu) { val >>= 8; k |= 8; }
if (val > 0x0000000Fu) { val >>= 4; k |= 4; }
if (val > 0x00000003u) { val >>= 2; k |= 2; }
k |= (val & 2) >> 1;
return k;
}
Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.
Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.
u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};
u8 log2_table(u32 val) {
u8 k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k = 16; }
if (val > 0x000000FFuL) { val >>= 8; k |= 8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte
return k;
}
This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.
I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.
int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
const unsigned int *mask = maskv;
int l, h;
if (a == 0) return -1;
l = 0;
h = 32;
do {
int m = l + (h - l) / 2;
if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;
mask++;
} while (l < h - 1);
return l;
}
A version in C using successive approximation:
unsigned int getMsb(unsigned int n)
{
unsigned int msb = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.
( 4 loops when using "unsigned int")
thats some kind of binary search, it works with all kinds of (unsigned!) integer types
#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int msb(UINT x)
{
if(0 == x)
return -1;
int c = 0;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}
return c;
}
to make complete:
#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))
int lsb(UINT x)
{
if(0 == x)
return -1;
int c = UINT_BIT-1;
for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}
return c;
}
Expanding on Josh's benchmark...
one can improve the clz as follows
/***************** clz2 ********************/
#define NUM_OF_HIGHESTBITclz2(a) ((a) \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.
As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
#include <stdio.h>
#include <stdint.h> /* for uint32_t */
/* CHAR_BIT (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT 8
#endif /* CHAR_BIT */
/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif /* BUILD_64 */
return r;
}
I know this question is very old, but just having implemented an msb() function myself,
I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:
Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.
See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.
It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).
Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).
Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:
template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");
for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}
return 0;
}
Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.
My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.
1The argument goes like this (rough draft):
Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.
If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:
(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n
This series of average iterations is actually convergent and has a limit of 2 for n towards infinity
Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.
c99 has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:
const auto n = 13UL;
const auto Index = (unsigned long)log2(n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
An n of 0UL needs to be guarded against as well, because:
-∞ is returned and FE_DIVBYZERO is raised
I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi
The visual-studio corollary to ephemient's gcc only answer is:
const auto n = 13UL;
unsigned long Index;
_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
The documentation for _BitScanReverse states that Index is:
Loaded with the bit position of the first set bit (1) found
In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:
0 if no set bits were found
Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409
Think bitwise operators.
I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:
position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}
Woaw, that was many answers. I am not sorry for answering on an old question.
int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){ value>>=(1<<5); result|=(1<<5); }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){ value>>=(1<<4); result|=(1<<4); }//and remove the 32msb
if(0x000000000000FF00&value){ value>>=(1<<3); result|=(1<<3); }
if(0x00000000000000F0&value){ value>>=(1<<2); result|=(1<<2); }
if(0x000000000000000C&value){ value>>=(1<<1); result|=(1<<1); }
if(0x0000000000000002&value){ result|=(1<<0); }
}else{
result=-1;
}
This answer is pretty similar to another answer... oh well.
Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
Putting this in since it's 'yet another' approach, seems to be different from others already given.
returns -1 if x==0, otherwise floor( log2(x)) (max result 31)
Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.
This is what I use when I don't want to use __builtin_clz because of portability issues.
To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.
int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;
// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];
for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;
msb_tab_15 = p;
}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63
int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31
int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496
2 candidates.HighestOne_Tab16C 628,234
3 candidates.HighestOne_Tab8A 649,146
4 candidates.HighestOne_Tab8B 656,847
5 candidates.HighestOne_Tab16B 657,147
6 candidates.HighestOne_Tab16D 659,650
7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900
8 de_Bruijn.IndexOfMSB 709,672
9 _old_2.HighestOne_Old2 715,810
10 _test_A.HighestOne8 757,188
11 _old_1.HighestOne_Old1 757,925
12 _test_A.HighestOne5 (unsafe) 760,387
13 _test_B.HighestOne8 (unsafe) 763,904
14 _test_A.HighestOne3 (unsafe) 766,433
15 _test_A.HighestOne1 (unsafe) 767,321
16 _test_A.HighestOne4 (unsafe) 771,702
17 _test_B.HighestOne2 (unsafe) 772,136
18 _test_B.HighestOne1 (unsafe) 772,527
19 _test_B.HighestOne3 (unsafe) 774,140
20 _test_A.HighestOne7 (unsafe) 774,581
21 _test_B.HighestOne7 (unsafe) 775,463
22 _test_A.HighestOne2 (unsafe) 776,865
23 candidates.HighestOne_NoTab 777,698
24 _test_B.HighestOne6 (unsafe) 779,481
25 _test_A.HighestOne6 (unsafe) 781,553
26 _test_B.HighestOne4 (unsafe) 785,504
27 _test_B.HighestOne5 (unsafe) 789,797
28 _test_A.HighestOne0 (unsafe) 809,566
29 _test_B.HighestOne0 (unsafe) 814,990
30 _highest_one_bit.HighestOne 824,345
30 _bitarray_ext.RtlFindMostSignificantBit 894,069
31 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne eax,ecx
ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
int j;
j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;
readonly public static sbyte[]
bsf64 =
{
63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,
},
bsr64 =
{
0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,
};
public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;
v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.
I assume your question is for an integer (called v below) and not an unsigned integer.
int v = 612635685; // whatever value you wish
unsigned int get_msb(int v)
{
int r = 31; // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.
while (!(v & 0x80000000) && r--) { // mask of the highest bit
v <<= 1; // multiply integer by 2.
}
return r; // will even return -1 if no bit was set, allowing error catch
}
If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).
Please let me know if I forgot anything. I haven't tested it but it should work just fine.
This looks big but works really fast compared to loop thank from bluegsmith
int Bit_Find_MSB_Fast(int x2)
{
long x = x2 & 0x0FFFFFFFFl;
long num_even = x & 0xAAAAAAAA;
long num_odds = x & 0x55555555;
if (x == 0) return(0);
if (num_even > num_odds)
{
if ((num_even & 0xFFFF0000) != 0) // top 4
{
if ((num_even & 0xFF000000) != 0)
{
if ((num_even & 0xF0000000) != 0)
{
if ((num_even & 0x80000000) != 0) return(32);
else
return(30);
}
else
{
if ((num_even & 0x08000000) != 0) return(28);
else
return(26);
}
}
else
{
if ((num_even & 0x00F00000) != 0)
{
if ((num_even & 0x00800000) != 0) return(24);
else
return(22);
}
else
{
if ((num_even & 0x00080000) != 0) return(20);
else
return(18);
}
}
}
else
{
if ((num_even & 0x0000FF00) != 0)
{
if ((num_even & 0x0000F000) != 0)
{
if ((num_even & 0x00008000) != 0) return(16);
else
return(14);
}
else
{
if ((num_even & 0x00000800) != 0) return(12);
else
return(10);
}
}
else
{
if ((num_even & 0x000000F0) != 0)
{
if ((num_even & 0x00000080) != 0)return(8);
else
return(6);
}
else
{
if ((num_even & 0x00000008) != 0) return(4);
else
return(2);
}
}
}
}
else
{
if ((num_odds & 0xFFFF0000) != 0) // top 4
{
if ((num_odds & 0xFF000000) != 0)
{
if ((num_odds & 0xF0000000) != 0)
{
if ((num_odds & 0x40000000) != 0) return(31);
else
return(29);
}
else
{
if ((num_odds & 0x04000000) != 0) return(27);
else
return(25);
}
}
else
{
if ((num_odds & 0x00F00000) != 0)
{
if ((num_odds & 0x00400000) != 0) return(23);
else
return(21);
}
else
{
if ((num_odds & 0x00040000) != 0) return(19);
else
return(17);
}
}
}
else
{
if ((num_odds & 0x0000FF00) != 0)
{
if ((num_odds & 0x0000F000) != 0)
{
if ((num_odds & 0x00004000) != 0) return(15);
else
return(13);
}
else
{
if ((num_odds & 0x00000400) != 0) return(11);
else
return(9);
}
}
else
{
if ((num_odds & 0x000000F0) != 0)
{
if ((num_odds & 0x00000040) != 0)return(7);
else
return(5);
}
else
{
if ((num_odds & 0x00000004) != 0) return(3);
else
return(1);
}
}
}
}
}
There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones
They are expected to be implemented as built-ins where possible, so sure it is an efficient way.
This is similar to what was recently added to C++ (std::countl_zero, etc).
The code:
// x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1
My humble method is very simple:
MSB(x) = INT[Log(x) / Log(2)]
Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).
This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.
Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.
#include <limits.h>
unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
And a little improved version for C++.
#include <climits>
constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}
constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}
constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
The code assumes that value won't be 0. If you want to allow 0, you need to modify it.
Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) {
const uint64_t M1 = 0x1111111111111111;
// we need to clear blocks of b=4 bits: log(w/b) >= b
n |= (n>>1); n |= (n>>2);
// reverse prefix scan, compiles to 1 mulx
uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;
// parallel-reduce each block
s |= (s>>1); s |= (s>>2);
// parallel reduce, 1 imul
uint64_t c = (s&M1)*(M1<<4);
// collect last nibble, generate compute count - count%4
c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
c &= (0x0F<<2); // zero the lowest 2 bits
// add the missing bits; this could be better solved with a bit of foresight
// by having the sum already stored
uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb
const uint64_t S = 0x3333333322221100; // last should give -1ul
return c | ((S>>(4*b)) & 0x03);
}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.
Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name
cycles
comment
clz
5.16
builtin intrinsic, fastest
cast
5.18
cast to double, extract exp
ulog2
7.50
reduction + deBrujin
msb64*
11.26
this version
unrolled
19.12
varying performance
obvious
110.49
"obviously" slowest for int64
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.

Resources