Efficient conditional for increasing size in bits

Efficient conditional for increasing size in bits - c

Suppose I have an increasing sequence of unsigned integers C[i]. As they increase, it's likely that they will occupy increasingly many bits. I'm looking for an efficient conditional, based purely on two consecutive elements of the sequence C[i] and C[i+1] (past and future ones are not observable), that will evaluate to true either exactly or approximately once for every time the number of bits required increases.
An obvious (but slow) choice of conditional is:
if (ceil(log(C[i+1])) > ceil(log(C[i]))) ...
and likewise anything that computes the number of leading zero bits using special cpu opcodes (much better but still not great).
I suspect there may be a nice solution involving an expression using just bitwise or and bitwise and on the values C[i+1] and C[i]. Any thoughts?

Suppose your two numbers are x and y. If they have the same high order bit, then x^y is less than both x and y. Otherwise, it is higher than one of the two.
So
v = x^y
if (v > x || v > y) { ...one more bit... }

I think you just need clz(C[i+1]) < clz(C[i]) where clz is a function which returns the number of leading zeroes ("count leading zeroes"). Some CPU families have an instruction for this (which may be available as an instrinsic). If not then you have to roll your own (it typically only takes a few instructions) - see Hacker's Delight.

Given (I believe this comes from Hacker's Delight):
int hibit(unsigned int n) {
n |= (n >> 1);
n |= (n >> 2);
n |= (n >> 4);
n |= (n >> 8);
n |= (n >> 16);
return n - (n >> 1);
}
Your conditional is simply hibit(C[i]) != hibit(C[i+1]).

BSR - Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes
reg,reg - - 10+3n 6-103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7
You need two of these (on C[i] and C[i]+1) and a compare.

Keith Randall's solution is good, but you can save one xor instruction by using the following code which processes the entire sequence in O(w + n) instructions, where w is the number of bits in a word, and n is the number of elements in the sequence. If the sequence is long, most iterations will only involve one comparison, avoiding one xor instruction.
This is accomplished by tracking the highest power of two that has been reached as follows:
t = 1; // original setting
if (c[i + 1] >= t) {
do {
t <<= 1;
} while (c[i + 1] >= t); // watch for overflow
... // conditional code here
}

The number of bits goes up when the value is about overflow a power of two. A simple test is then, is the value equal to a power of two, minus 1? This can be accomplished by asking:
if ((C[i] & (C[i]+1))==0) ...

The number of bits goes up when the value is about to overflow a power of two.
A simple test is then:
while (C[i] >= (1<<number_of_bits)) then number_of_bits++;
If you want it even faster:
int number_of_bits = 1;
int two_to_number_of_bits = 1<<number_of_bits ;
... your code ....
while ( C[i]>=two_to_number_of_bits )
{ number_of_bits++;
two_to_number_of_bits = 1<<number_of_bits ;
}

Related

UNDERSTANDING how to count trailing zeros for a number using bitwise operators in C

Note - This is NOT a duplicate of this question - Count the consecutive zero bits (trailing) on the right in parallel: an explanation? . The linked question has a different context, it only asks the purpose of signed() being use. DO NOT mark this question as duplicate.
I've been finding a way to acquire the number of trailing zeros in a number. I found a bit twiddling Stanford University Write up HERE here that gives the following explanation.
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -signed(v);
if (v) c--;
if (v & 0x0000FFFF) c -= 16;
if (v & 0x00FF00FF) c -= 8;
if (v & 0x0F0F0F0F) c -= 4;
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;
Why does this end up working ? I have an understanding of how Hex numbers are represented as binary and bitwise operators, but I am unable to figure out the intuition behind this working ? What is the working mechanism ?

The code is broken (undefined behavior is present). Here is a fixed version which is also slightly easier to understand (and probably faster):
uint32_t v; // 32-bit word input to count zero bits on right
unsigned c; // c will be the number of zero bits on the right
if (v) {
v &= -v; // keep rightmost set bit (the one that determines the answer) clear all others
c = 0;
if (v & 0xAAAAAAAAu) c |= 1; // binary 10..1010
if (v & 0xCCCCCCCCu) c |= 2; // binary 1100..11001100
if (v & 0xF0F0F0F0u) c |= 4;
if (v & 0xFF00FF00u) c |= 8;
if (v & 0xFFFF0000u) c |= 16;
}
else c = 32;
Once we know only one bit is set, we determine one bit of the result at a time, by simultaneously testing all bits where the result is odd, then all bits where the result has the 2's-place set, etc.
The original code worked in reverse, starting with all bits of the result set (after the if (c) c--;) and then determining which needed to be zero and clearing them.
Since we are learning one bit of the output at a time, I think it's more clear to build the output using bit operations not arithmetic.

This code (from the net) is mostly C, although v &= -signed(v); isn't correct C. The intent is for it to behave as v &= ~v + 1;
First, if v is zero, then it remains zero after the & operation, and all of the if statements are skipped, so you get 32.
Otherwise, the & operation (when corrected) clears all bits to the left of the rightmost 1, so at that point v contains a single 1 bit. Then c is decremented to 31, i.e. all 1 bits within the possible result range.
The if statements then determine its numeric position one bit at a time (one bit of the position number, not of v), clearing the bits that should be 0.

The code first transforms v is such a way that is is entirely null, except the left most one that remains. Then, it determines the position of this first one.
First let's see how we suppress all ones but the left most one.
Assume that k is the position of the left most one in v. v=(vn-1,vn-2,..vk+1,1,0,..0).
-v is the number that added to v will give 0 (actually it gives 2^n, but bit 2^n is ignored if we only keep the n less significant bits).
What must the value of bits in -v so that v+-v=0?
obviously bits k-1..0 of -k must be at 0 so that added to the trailing zeros in v they give a zero.
bit k must be at 1. Added to the one in vk, it will give a zero and a carry at one at order k+1
bit k+1 of -v will be added to vk+1 and to the carry generated at step k. It must be the logical complement of vk+1. So whatever the value of vk+1, we will have 1+0+1 if vk+1=0 (or 1+1+0 if vk+1=1) and result will be 0 at order k+1 with a carry generated at order k+2.
This is similar for bits n-1..k+2 and they must all be the logical complement of the corresponding bit in v.
Hence, we get the well-known result that to get -v, one must
leave unchanged all trailing zeros of v
leave unchanged the left most one of v
complement all the other bits.
If we compute v&-v, we have
v vn-1 vn-2 ... vk+1 1 0 0 ... 0
-v & ~vn-1 ~vn-2 ... ~vk+1 1 0 0 ... 0
v&-v 0 0 ... 0 1 0 0 ... 0
So v&-v only keeps the left most one in v.
To find the location of first one, look at the code:
if (v) c--; // no 1 in result? -> 32 trailing zeros.
// Otherwise it will be in range c..0=31..0
if (v & 0x0000FFFF) c -= 16; // If there is a one in left most part of v the range
// of possible values for the location of this one
// will be 15..0.
// Otherwise, range must 31..16
// remaining range is c..c-15
if (v & 0x00FF00FF) c -= 8; // if there is one in either byte 0 (c=15) or byte 2 (c=31),
// the one is in the lower part of range.
// So we must substract 8 to boundaries of range.
// Other wise, the one is in the upper part.
// Possible range of positions of v is now c..c-7
if (v & 0x0F0F0F0F) c -= 4; // do the same for the other bits.
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;

CRC calculation reduction

I have one math and programming related question about CRC calculations, to avoid recompute full CRC for a block when you must change only a small portion of it.
My problem is the following: I have a 1K block of 4 byte structures, each one representing a data field. The full 1K block has a CRC16 block at the end, computed over the full 1K. When I have to change only a 4 byte structure, I should recompute the CRC of the full block but I'm searching for a more efficient solution to this problem. Something where:
I take the full 1K block current CRC16
I compute something on the old 4 byte block
I "subtract" something obtained at step 2 from the full 1K CRC16
I compute something on the new 4 byte block
I "add" something obtained at step 4 to the result obtained at step 3
To summarize, I am thinking about something like this:
CRC(new-full) = [CRC(old-full) - CRC(block-old) + CRC(block-new)]
But I'm missing the math behind and what to do to obtain this result, considering also a "general formula".
Thanks in advance.

Take your initial 1024-byte block A and your new 1024-byte block B. Exclusive-or them to get block C. Since you only changed four bytes, C will be bunch of zeros, four bytes which are the exclusive-or of the previous and new four bytes, and a bunch more zeros.
Now compute the CRC-16 of block C, but without any pre or post-processing. We will call that CRC-16'. (I would need to see the specific CRC-16 you're using to see what that processing is, if anything.) Exclusive-or the CRC-16 of block A with the CRC-16' of block C, and you now have the CRC-16 of block B.
At first glance, this may not seem like much of a gain compared to just computing the CRC of block B. However there are tricks to rapidly computing the CRC of a bunch of zeros. First off, the zeros preceding the four bytes that were changed give a CRC-16' of zero, regardless of how many zeros there are. So you just start computing the CRC-16' with the exclusive-or of the previous and new four bytes.
Now you continue to compute the CRC-16' on the remaining n zeros after the changed bytes. Normally it takes O(n) time to compute a CRC on n bytes. However if you know that they are all zeros (or all some constant value), then it can be computed in O(log n) time. You can see an example of how this is done in zlib's crc32_combine() routine, and apply that to your CRC.
Given your CRC-16/DNP parameters, the zeros() routine below will apply the requested number of zero bytes to the CRC in O(log n) time.
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial, reflected. For speed, this requires that a not be zero.
uint16_t multmodp(uint16_t a, uint16_t b) {
uint16_t m = (uint16_t)1 << 15;
uint16_t p = 0;
for (;;) {
if (a & m) {
p ^= b;
if ((a & (m - 1)) == 0)
break;
}
m >>= 1;
b = b & 1 ? (b >> 1) ^ 0xa6bc : b >> 1;
}
return p;
}
// Table of x^2^n modulo p(x).
uint16_t const x2n_table[] = {
0x4000, 0x2000, 0x0800, 0x0080, 0xa6bc, 0x55a7, 0xfc4f, 0x1f78,
0xa31f, 0x78c1, 0xbe76, 0xac8f, 0xb26b, 0x3370, 0xb090
};
// Return x^(n*2^k) modulo p(x).
uint16_t x2nmodp(size_t n, unsigned k) {
k %= 15;
uint16_t p = (uint16_t)1 << 15;
for (;;) {
if (n & 1)
p = multmodp(x2n_table[k], p);
n >>= 1;
if (n == 0)
break;
if (++k == 15)
k = 0;
}
return p;
}
// Apply n zero bytes to crc.
uint16_t zeros(uint16_t crc, size_t n) {
return multmodp(x2nmodp(n, 3), crc);
}

CRC actually makes this an easy thing to do.
When looking into this, I'm sure you've started to read that CRCs are calculated with polynomials over GF(2), and probably skipped over that part to the immediately useful information. Well, it sounds like it's probably time for you to go back over that stuff and reread it a few times so you can really understand it.
But anyway...
Because of the way CRCs are calculated, they have a property that, given two blocks A and B, CRC(A xor B) = CRC(A) xor CRC(B)
So the first simplification you can make is that you just need to calculate the CRC of the changed bits. You could actually precalculate the CRCs of each bit in the block, so that when you change a bit you can just xor it's CRC into the block's CRC.
CRCs also have the property that CRC(A * B) = CRC(A * CRC(B)), where that * is polynomial multiplication over GF(2). If you stuff the block with zeros at the end, then don't do that for CRC(B).
This lets you get away with a smaller precalculated table. "Polynomial multiplication over GF(2)" is binary convolution, so multiplying by 1000 is the same as shifting by 3 bits. With this rule, you can precalculate the CRC of the offset of each field. Then just multiply (convolve) the changed bits by the offset CRC (calculated without zero stuffing), calculate the CRC of those 8 byes, and xor them into the block CRC.

The CRC is the remainder of the long integer formed by the input stream and the short integer corresponding to the polynomial, say p.
If you change some bits in the middle, this amounts to a perturbation of the dividend by n 2^k where n has the length of the perturbed section and k is the number of bits that follow.
Hence, you need to compute the perturbation of the remainder, (n 2^k) mod p. You can address this using
(n 2^k) mod p = (n mod p) (2^k mod p)
The first factor is just the CRC16 of n. The other factor can be obtained efficiently in Log k operations by the power algorithm based on squarings.

CRC depends of the calculated CRC of the data before.
So the only optimization is, to logical split the data into N segment and store the computed CRC-state for each segment.
Then, of e.g. modifying segment 6 (of 0..9), get the CRC-state of segment 5, and continue calculating CRC beginning with segment 6 and ending with 9.
Anyway, CRC calculations are very fast. So think, if it is worth it.

Unary negation of unsigned integer 4

If x is an unsigned int type is there a difference in these statements:
return (x & 7);
and
return (-x & 7);
I understand negating an unsigned value gives a value of max_int - value. But is there a difference in the return value (i.e. true/false) among the above two statements under any specific boundary conditions OR are they both same functionally?

Test code:
#include <stdio.h>
static unsigned neg7(unsigned x) { return -x & 7; }
static unsigned pos7(unsigned x) { return +x & 7; }
int main(void)
{
for (unsigned i = 0; i < 8; i++)
printf("%u: pos %u; neg %u\n", i, pos7(i), neg7(i));
return 0;
}
Test results:
0: pos 0; neg 0
1: pos 1; neg 7
2: pos 2; neg 6
3: pos 3; neg 5
4: pos 4; neg 4
5: pos 5; neg 3
6: pos 6; neg 2
7: pos 7; neg 1
For the specific case of 4 (and also 0), there isn't a difference; for other values, there is a difference. You can extend the range of the input, but the outputs will produce the same pattern.

If you ask specifically for true/false (i.e. is zero / not zero) and two's complement then there is indeed no difference. (You do however return not just a simple truth value but allow different bit patterns for true. As long as the caller does not distinguish, that is fine.)
Consider how a two's complement negation is formed: invert the bits then increment. Since you take only the least significant bits, there will be no carry in for the increment. This is a necessity, so you can't do this with anything but a range of least significant bits.
Let's look at the two cases:
First, if the three low bits are zero (for a false equivalent). Inverting gives all ones, incrementing turns them to zero again. The fourth and more significant bits might be different, but they don't influence the least significant bits and they don't influence the result since they are masked out. So this stays.
Second, if the three low bits are not all zero (for a true equivalent). The only way this can change into false is when the increment operation leaves them at zero, which can only happen if they were all ones before, which in turn could only happen if they were all zeros before the inversion. That can't be, since that is the first case. Again, the more significant bits don't influence the three low bits and they are masked out. So the result does not change.
But again, this only works when the caller considers only the truth value (all bits zero / not all bits zero) and when the mask allows a range of bits starting from the least significant without a gap.

Firstly, negating an unsigned int value produces UINT_MAX - original_value + 1. (For example, 0 remains 0 under negation). The alternative way to describe negation is full inversion of all bits followed by increment.
It is not clear why you'd even ask this question, since it is obvious that basically the very first example that comes to mind — an unsigned int value 1 — already produces different results in your expression. 1u & 7 is 1, while -1u & 7 is 7. Did you mean something else, by any chance?

Is there an easy way to get which power of two a number is?

If I have a number that I am certain is a power of two, is there a way to get which power of two the number is? I have thought of this idea:
Having a count and shifting the number right by 1 and incrementing the count until the number is 0. Is there another way, though? Without keeping a counter?
Edit:
Here are some examples:
8 -> returns 3
16 -> returns 4
32 -> returns 5

The most elegant method is De Bruijn sequences. Here's a previous answer I gave to a similar question on how to use them to solve the problem:
Bit twiddling: which bit is set?
An often-more-practical approach is using your cpu's built-in instruction for finding the first/last bit set.

You could use the log function in cmath...
double exponent = log(number)/log(2.0);
...and then cast it to an int afterwards.

If that number is called x, you can find it by computing log2f(x). The return value is a float.
You will need to include <math.h> in order to use log2f.

That method certainly would work. Another possible way would be to eliminate half of the possibilities every time. Say you have an 8 bit number: 00010000
Bitwise and your number where half of the bits are one, and the other half is zero, say 00001111.
00010000 & 00001111 = 00000000
Now you know it's not in the first four bits. Do this repeatedly, until you don't get 0:
00010000 & 00110000 = 00010000
And than narrow it down to one possible bit which is 1, which is your power of two.

Use binary search instead of linear:
public void binarySearch() throws Exception {
int num = 0x40000;
int k = 0;
int shift = 16; // half of the size of the type (for in 16, etc)
int a = 0xffff; // lower half should be f's
while (shift != 0) {
if ((num & a) == 0) {
num = num >>> shift;
k += shift;
shift >>= 1;
} else {
shift >>= 1;
}
a = a >>> shift;
}
System.out.println(k);
}

If you're doing a for loop like I am, one method is to power the loop counter before comparison:
for (var i = 1; Math.pow(2, i) <= 1048576; i++) {
// iterates every power of two until 2^20
}

How to calculate 2^n-1 efficiently without overflow?

I want to calculate 2n-1 for a 64bit integer value.
What I currently do is this
for(i=0; i<n; i++) r|=1<<i;
and I wonder if there is more elegant way to do it.
The line is in an inner loop, so I need it to be fast.
I thought of
r=(1ULL<<n)-1;
but it doesn't work for n=64, because << is only defined
for values of n up to 63.
EDIT:
Thanks for all your answers and comments.
Here is a little table with the solutions that I tried and liked best.
Second column is time in seconds of my (completely unscientific) benchmark.
r=N2MINUSONE_LUT[n]; 3.9 lookup table = fastest, answer by aviraldg
r =n?~0ull>>(64 - n):0ull; 5.9 fastest without LUT, comment by Christoph
r=(1ULL<<n)-1; 5.9 Obvious but WRONG!
r =(n==64)?-1:(1ULL<<n)-1; 7.0 Short, clear and quite fast, answer by Gabe
r=((1ULL<<(n/2))<<((n+1)/2))-1; 8.2 Nice, w/o spec. case, answer by drawnonward
r=(1ULL<<n-1)+((1ULL<<n-1)-1); 9.2 Nice, w/o spec. case, answer by David Lively
r=pow(2, n)-1; 99.0 Just for comparison
for(i=0; i<n; i++) r|=1<<i; 123.7 My original solution = lame
I accepted
r =n?~0ull>>(64 - n):0ull;
as answer because it's in my opinion the most elegant solution.
It was Christoph who came up with it at first, but unfortunately he only posted it in a
comment. Jens Gustedt added a really nice rationale, so I accept his answer instead. Because I liked Aviral Dasgupta's lookup table solution it got 50 reputation points via a bounty.

Use a lookup table. (Generated by your present code.) This is ideal, since the number of values is small, and you know the results already.
/* lookup table: n -> 2^n-1 -- do not touch */
const static uint64_t N2MINUSONE_LUT[] = {
0x0,
0x1,
0x3,
0x7,
0xf,
0x1f,
0x3f,
0x7f,
0xff,
0x1ff,
0x3ff,
0x7ff,
0xfff,
0x1fff,
0x3fff,
0x7fff,
0xffff,
0x1ffff,
0x3ffff,
0x7ffff,
0xfffff,
0x1fffff,
0x3fffff,
0x7fffff,
0xffffff,
0x1ffffff,
0x3ffffff,
0x7ffffff,
0xfffffff,
0x1fffffff,
0x3fffffff,
0x7fffffff,
0xffffffff,
0x1ffffffff,
0x3ffffffff,
0x7ffffffff,
0xfffffffff,
0x1fffffffff,
0x3fffffffff,
0x7fffffffff,
0xffffffffff,
0x1ffffffffff,
0x3ffffffffff,
0x7ffffffffff,
0xfffffffffff,
0x1fffffffffff,
0x3fffffffffff,
0x7fffffffffff,
0xffffffffffff,
0x1ffffffffffff,
0x3ffffffffffff,
0x7ffffffffffff,
0xfffffffffffff,
0x1fffffffffffff,
0x3fffffffffffff,
0x7fffffffffffff,
0xffffffffffffff,
0x1ffffffffffffff,
0x3ffffffffffffff,
0x7ffffffffffffff,
0xfffffffffffffff,
0x1fffffffffffffff,
0x3fffffffffffffff,
0x7fffffffffffffff,
0xffffffffffffffff,
};

How about a simple r = (n == 64) ? -1 : (1ULL<<n)-1;?

If you want to get the max value just before overflow with a given number of bits, try
r=(1ULL << n-1)+((1ULL<<n-1)-1);
By splitting the shift into two parts (in this case, two 63 bit shifts, since 2^64=2*2^63), subtracting 1 and then adding the two results together, you should be able to do the calculation without overflowing the 64 bit data type.

if (n > 64 || n < 0)
return undefined...
if (n == 64)
return 0xFFFFFFFFFFFFFFFFULL;
return (1ULL << n) - 1;

I like aviraldg answer best.
Just to get rid of the `ULL' stuff etc in C99 I would do
static inline uint64_t n2minusone(unsigned n) {
return n ? (~(uint64_t)0) >> (64u - n) : 0;
}
To see that this is valid
an uint64_t is guaranteed to have a width of exactly 64 bit
the bit negation of that `zero of type uint64_t' has thus exactly
64 one bits
right shift of an unsigned value is guaranteed to be a logical
shift, so everything is filled with zeros from the left
shift with a value equal or greater to the width is undefined, so
yes you have to do at least one conditional to be sure of your result
an inline function (or alternatively a cast to uint64_t if you
prefer) makes this type safe; an unsigned long long may
well be an 128 bit wide value in the future
a static inline function should be seamlessly
inlined in the caller without any overhead

The only problem is that your expression isn't defined for n=64? Then special-case that one value.
(n == 64 ? 0ULL : (1ULL << n)) - 1ULL

Shifting 1 << 64 in a 64 bit integer yields 0, so no need to compute anything for n > 63; shifting should be enough fast
r = n < 64 ? (1ULL << n) - 1 : 0;
But if you are trying this way to know the max value a N bit unsigned integer can have, you change 0 into the known value treating n == 64 as a special case (and you are not able to give a result for n > 64 on hardware with 64bit integer unless you use a multiprecision/bignumber library).
Another approach with bit tricks
~-(1ULL << (n-1) ) | (1ULL << (n-1))
check if it can be semplified... of course, n>0
EDIT
Tests I've done
__attribute__((regparm(0))) unsigned int calcn(int n)
{
register unsigned int res;
asm(
" cmpl $32, %%eax\n"
" jg mmno\n"
" movl $1, %%ebx\n" // ebx = 1
" subl $1, %%eax\n" // eax = n - 1
" movb %%al, %%cl\n" // because of only possible shll reg mode
" shll %%cl, %%ebx\n" // ebx = ebx << eax
" movl %%ebx, %%eax\n" // eax = ebx
" negl %%ebx\n" // -ebx
" notl %%ebx\n" // ~-ebx
" orl %%ebx, %%eax\n" // ~-ebx | ebx
" jmp mmyes\n"
"mmno:\n"
" xor %%eax, %%eax\n"
"mmyes:\n"
:
"=eax" (res):
"eax" (n):
"ebx", "ecx", "cc"
);
return res;
}
#define BMASK(X) (~-(1ULL << ((X)-1) ) | (1ULL << ((X)-1)))
int main()
{
int n = 32; //...
printf("%08X\n", BMASK(n));
printf("%08X %d %08X\n", calcn(n), n&31, BMASK(n&31));
return 0;
}
Output with n = 32 is -1 and -1, while n = 52 yields "-1" and 0xFFFFF, casually 52&31 = 20 and of course n = 20 gives 0xFFFFF...
EDIT2 now the asm code produces 0 for n > 32 (since I am on a 32 bit machine), but at this point the a ? b : 0 solution with the BMASK is clearer and I doubt the asm solution is too much faster (if speed is a so big concern the table idea could be the faster).

Since you've asked for an elegant way to do it:
const uint64_t MAX_UINT64 = 0xffffffffffffffffULL;
#define N2MINUSONE(n) ((MAX_UINT64>>(64-(n))))

I hate it that (a) n << 64 is undefined and (b) on the popular Intel hardware shifting by word size is a no-op.
You have three ways to go here:
Lookup table. I recommend against this because of the memory traffic, plus you will write a lot of code to maintain the memory traffic.
Conditional branch. Check if n is equal to the word size (8 * sizeof(unsigned long long)), if so, return ~(unsigned long long)0, otherwise shift and subtract as usual.
Try to get clever with arithmetic. For example, in real numbers 2^n = 2^(n-1) + 2^(n-1), and you can exploit this identity to make sure you never use a power equal to the word size. But you had better be very sure that n is never zero, because if it is, this identity cannot be expressed in the integers, and shifting left by -1 is likely to bite you in the ass.
I personally would go with the conditional branch—it is the hardest to screw up, manifestly handles all reasonable cases of n, and with modern hardware the likelihood of a branch misprediction is small. Here's what I do in my real code:
/* What makes things hellish is that C does not define the effects of
a 64-bit shift on a 64-bit value, and the Intel hardware computes
shifts mod 64, so that a 64-bit shift has the same effect as a
0-bit shift. The obvious workaround is to define new shift functions
that can shift by 64 bits. */
static inline uint64_t shl(uint64_t word, unsigned bits) {
assert(bits <= 64);
if (bits == 64)
return 0;
else
return word << bits;
}

I think the issue you are seeing is caused because (1<<n)-1 is evaluated as (1<<(n%64))-1 on some chips. Especially if n is or can be optimized as a constant.
Given that, there are many minor variations you can do. For example:
((1ULL<<(n/2))<<((n+1)/2))-1;
You will have to measure to see if that is faster then special casing 64:
(n<64)?(1ULL<<n)-1:~0ULL;

It is true that in C each bit-shifting operation has to shift by less bits than there are bits in the operand (otherwise, the behavior is undefined). However, nobody prohibits you from doing the shift in two consecutive steps
r = ((1ULL << (n - 1)) << 1) - 1;
I.e. shift by n - 1 bits first and then make an extra 1 bit shift. In this case, of course, you have to handle n == 0 situation in a special way, if that is a valid input in your case.
In any case, it is better than your for cycle. The latter is basically the same idea but taken to the extreme for some reason.

Ub = universe in bits = lg(U):
high(v) = v >> (Ub / 2)
low(v) = v & ((~0) >> (Ub - Ub / 2)) // Deal with overflow and with Ub even or odd

You can exploit integer division inaccuracy and use the modulo of the exponent to ensure you always shift in the range [0, (sizeof(uintmax_t) * CHAR_BIT) - 1] to create a universal pow2i function for integers of the largest supported native word size, however, this can easily be tweaked to support arbitrary word sizes.
I honestly don't get why this isn't just the implementation in hardware for bit shift overflows.
#include <limits.h>
static inline uintmax_t pow2i(uintmax_t exponent) {
#define WORD_BITS ( sizeof(uintmax_t) * CHAR_BIT )
return ((uintmax_t) 1) << (exponent / WORD_BITS) << (exponent % WORD_BITS);
#undef WORD_BITS
}
From there, you can calculate pow2i(n) - 1.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight