Related
How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?
I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
We use a 64-bit CPU
Which 64-bit CPU?
In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.
Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".
In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.
Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".
Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.
Hopefully you can see how this situation might be considered "less than ideal".
What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:
__uint128_t numerator = (__uint128_t)1 << 64;
if(n > 1) {
return (uint64_t)(numerator/n);
}
I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).
In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.
For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
...and the best code that's possible is:
mov edx, 1
xor rax, rax
div rdi
ret
Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.
How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?
I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
We use a 64-bit CPU
Which 64-bit CPU?
In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.
Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".
In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.
Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".
Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.
Hopefully you can see how this situation might be considered "less than ideal".
What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:
__uint128_t numerator = (__uint128_t)1 << 64;
if(n > 1) {
return (uint64_t)(numerator/n);
}
I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).
In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.
For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
...and the best code that's possible is:
mov edx, 1
xor rax, rax
div rdi
ret
Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.
I have an unsigned 32 bit integer encoded in the following way:
the first 6 bits define the opcode
next 8 bits define a register
next 18 bits are a two's complement signed integer value.
I am currently decoding this number (uint32_t inst) using:
const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const int32_t value = ((inst >> 17) & 0x01) ? -(131072 - (inst & 0x1FFFF)) : (inst & 0x1FFFF);
I can measure a significant overhead while decoding the value and I am quite sure it is due to the ternary operator (essentially an if statment) used to compare the sign an perform the negative operation.
Is there a way to perform value decoding in a faster way?
Your expression is more complicated than it needs to be, especially in needlessly involving the ternary operator. The following expression computes the same results for all inputs without involving the ternary operator.* It is a good candidate for a replacement, but as with any optimization problem, it is essential to test:
const int32_t value = (int32_t)(inst & 0x1FFFF) - (int32_t)(inst & 0x20000);
Or this variation on #doynax's suggestion along similar lines might be more optimizer-friendly:
const int32_t value = (int32_t)(inst & 0x3FFFF ^ 0x20000) - (int32_t)0x20000;
In each case, the casts avoid implementation-defined behavior; on many architectures they would be no-ops as far as the machine code is concerned. On those architectures, these expressions involve fewer operations in all cases than does yours, not to mention being unconditional.
Competitive alternatives involving shifting may also optimize well, but all such alternatives necessarily rely on implementation-defined behavior because of integer overflow of a left shift, a negative integer being the left-hand operand of a right shift, and / or converting an out-of-range value to a signed integer type. You will have to determine for yourself whether constitutes a problem.
* as compiled by GCC 4.4.7 for x86_64. The original expression invokes implementation-defined behavior for some inputs, so on other implementations the two expressions might compute different values for those inputs.
A standard (even though non-portable) practice is a left-shift followed by an arithmetic right-shift:
const int32_t temp = inst << 14; // "shift out" the 14 unneeded bits
const int32_t value = temp >> 14; // shift the number back; sign-extend
This involves a conversion from uint32_t to int32_t and a right-shift of a possibly negative int32_t; both operations are implementation-defined, i.e. unportable (work on 2's complement systems; pretty much guaranteed to work on any architecture). If you want to gain the best performance and willing to rely on implementation-defined behavior, you can use this code.
As a single expression:
const int32_t value = (int32_t)(inst << 14) >> 14;
Note: the following looks cleaner, will also typically work, but involves undefined behavior (signed integer overflow):
const int32_t value = (int32_t)inst << 14 >> 14;
Don't use it! (even though you probably won't receive any warning or error about it).
For ideal compiler output with no implementation-defined or undefined behaviour, use #doynax's 2's complement decoding expression:
value = (int32_t)((inst & 0x3FFFF) ^ 0x20000) - (int32_t)0x20000;
The casts make sure we're doing signed subtraction, rather than unsigned with wraparound and then assigning that bit-pattern to a signed integer.
This compiles to optimal asm on ARM, where gcc uses sbfx r1, r1, #0, #18 (signed bitfield-extract) to sign-extend bits [17:0] into a full int32_t register. On x86, it uses shl by 14 and sar by 14 (arithmetic shift) to do the same thing. This is a clear sign that gcc recognizes the 2's complement pattern and uses whatever is most optimal on the target machine to sign-extend the bitfield.
There isn't a portable way to make sure bitfields are ordered the way you want them. gcc appears to order bitfields from LSB to MSB for little-endian targets, but MSB to LSB for big-endian targets. You can use an #if to get identical asm output for ARM with/without -mbig-endian, just like the other methods, but there's no guarantee that other compilers work the same.
If gcc/clang didn't see through the xor and sub, it would be worth considering the <<14 / >>14 implementation which hand-holds the compiler towards doing it that way. Or considering the signed/unsigned bitfield approach with an #if.
But since we can get ideal asm from gcc/clang with fully safe and portable code, we should just do that.
See the code on the Godbolt Compiler Explorer, for versions from most of the answers. You can look at asm output for x86, ARM, ARM64, or PowerPC.
// have to put the results somewhere, so the function doesn't optimize away
struct decode {
//unsigned char opcode, r1;
unsigned int opcode, r1;
int32_t value;
};
// in real code you might return the struct by value, but there's less ABI variation when looking at the ASM this way (some would pack the struct into registers)
void decode_two_comp_doynax(struct decode *result, uint32_t inst) {
result->opcode = ((inst >> 26) & 0x3F);
result->r1 = (inst >> 18) & 0xFF;
result->value = ((inst & 0x3FFFF) ^ 0x20000) - 0x20000;
}
# clang 3.7.1 -O3 -march=haswell (enables BMI1 bextr)
mov eax, esi
shr eax, 26 # grab the top 6 bits with a shift
mov dword ptr [rdi], eax
mov eax, 2066 # (0x812)# only AMD provides bextr r32, r32, imm. Intel has to set up the constant separately
bextr eax, esi, eax # extract the middle bitfield
mov dword ptr [rdi + 4], eax
shl esi, 14 # <<14
sar esi, 14 # >>14 (arithmetic shift)
mov dword ptr [rdi + 8], esi
ret
You may consider using bit-fields to simplify your code.
typedef struct inst_type {
#ifdef MY_MACHINE_NEEDS_THIS
uint32_t opcode : 6;
uint32_t r1 : 8;
int32_t value : 18;
#else
int32_t value : 18;
uint32_t r1 : 8;
uint32_t opcode : 6;
#endif
} inst_type;
const uint32_t opcode = inst.opcode;
const uint32_t r1 = inst.r1;
const int32_t value = inst.value;
Direct bit manipulation often performs better, but not always. Using John Bollinger's answer as a baseline, the above structure results in one fewer instruction to extract the three values of interest on GCC (but fewer instructions does not necessarily mean faster).
const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const uint32_t negative = ((inst >> 17) & 0x01);
const int32_t value = -(negative * 131072 - (inst & 0x1FFFF));
when negative is 1 -(131072 - (inst & 0x1FFFF)) and for 0: -(0 - (inst & 0x1FFFF)) which is equal to inst & 0x1FFFF.
I'm calling this code on a function so that a signed int n is set to zero, but it behaves strangely.
printf("n is %d \n", n);
printf("shift1 %d \n", -1 << (32 + (~0 + 1)));
printf("shift2 %d \n", -1 << (32 + (~n + 1)));
prints
n is 0
shift1 0
shift2 -2
I have no idea why this is happening, since n == 0.
It behaves strangely because the behavior of the << operator applied to a negative integer is undefined. Therefore, any result is a valid result of this code.
Since the behavior is not defined, we cannot reason about it. So we cannot really say why it differs, only that according to the C specification it is allowed to differ.
When I try it, I get zero as the result of both shifts. (Neither of our results are more correct! This is simply to show that the same program invoking undefined behavior can indeed produce different results on different compilers and/or architectures.)
Shifting a negative number under any circumstances, by any amount (even 0), in any direction, is always Undefined Behavior (UB).
The same for shifting a signed value such that the mathematical result cannot be stored in its type.
One cannot reason about undefined behavior.
I can reproduce the difference even after changing -1 to 1 (with Visual Studio, in DEBUG configuration):
printf("n is %d \n", n);
printf("shift1 %d \n", 1 << (32 + (~0 + 1)));
printf("shift2 %d \n", 1 << (32 + (~n + 1)));
For the first line, I'm getting a warning:
warning C4293: '<<' : shift count negative or too big, undefined behavior
The point is that in the first case, my compiler evaluates expression at compile time and sees that the operand is too big, and in the second case - at run time.
In my case, at compile time, the expression gets evaluated to 0, at run time - to 1
int n = 0;
mov dword ptr [n],0
auto x = 1 << (32 + (~0 + 1));
mov dword ptr [x],0
auto y = 1 << (32 + (~n + 1));
mov ecx,dword ptr [n]
not ecx
add ecx,21h
mov eax,1
shl eax,cl
mov dword ptr [y],eax
In RELEASE mode, both results are the same.
I want to calculate 2n-1 for a 64bit integer value.
What I currently do is this
for(i=0; i<n; i++) r|=1<<i;
and I wonder if there is more elegant way to do it.
The line is in an inner loop, so I need it to be fast.
I thought of
r=(1ULL<<n)-1;
but it doesn't work for n=64, because << is only defined
for values of n up to 63.
EDIT:
Thanks for all your answers and comments.
Here is a little table with the solutions that I tried and liked best.
Second column is time in seconds of my (completely unscientific) benchmark.
r=N2MINUSONE_LUT[n]; 3.9 lookup table = fastest, answer by aviraldg
r =n?~0ull>>(64 - n):0ull; 5.9 fastest without LUT, comment by Christoph
r=(1ULL<<n)-1; 5.9 Obvious but WRONG!
r =(n==64)?-1:(1ULL<<n)-1; 7.0 Short, clear and quite fast, answer by Gabe
r=((1ULL<<(n/2))<<((n+1)/2))-1; 8.2 Nice, w/o spec. case, answer by drawnonward
r=(1ULL<<n-1)+((1ULL<<n-1)-1); 9.2 Nice, w/o spec. case, answer by David Lively
r=pow(2, n)-1; 99.0 Just for comparison
for(i=0; i<n; i++) r|=1<<i; 123.7 My original solution = lame
I accepted
r =n?~0ull>>(64 - n):0ull;
as answer because it's in my opinion the most elegant solution.
It was Christoph who came up with it at first, but unfortunately he only posted it in a
comment. Jens Gustedt added a really nice rationale, so I accept his answer instead. Because I liked Aviral Dasgupta's lookup table solution it got 50 reputation points via a bounty.
Use a lookup table. (Generated by your present code.) This is ideal, since the number of values is small, and you know the results already.
/* lookup table: n -> 2^n-1 -- do not touch */
const static uint64_t N2MINUSONE_LUT[] = {
0x0,
0x1,
0x3,
0x7,
0xf,
0x1f,
0x3f,
0x7f,
0xff,
0x1ff,
0x3ff,
0x7ff,
0xfff,
0x1fff,
0x3fff,
0x7fff,
0xffff,
0x1ffff,
0x3ffff,
0x7ffff,
0xfffff,
0x1fffff,
0x3fffff,
0x7fffff,
0xffffff,
0x1ffffff,
0x3ffffff,
0x7ffffff,
0xfffffff,
0x1fffffff,
0x3fffffff,
0x7fffffff,
0xffffffff,
0x1ffffffff,
0x3ffffffff,
0x7ffffffff,
0xfffffffff,
0x1fffffffff,
0x3fffffffff,
0x7fffffffff,
0xffffffffff,
0x1ffffffffff,
0x3ffffffffff,
0x7ffffffffff,
0xfffffffffff,
0x1fffffffffff,
0x3fffffffffff,
0x7fffffffffff,
0xffffffffffff,
0x1ffffffffffff,
0x3ffffffffffff,
0x7ffffffffffff,
0xfffffffffffff,
0x1fffffffffffff,
0x3fffffffffffff,
0x7fffffffffffff,
0xffffffffffffff,
0x1ffffffffffffff,
0x3ffffffffffffff,
0x7ffffffffffffff,
0xfffffffffffffff,
0x1fffffffffffffff,
0x3fffffffffffffff,
0x7fffffffffffffff,
0xffffffffffffffff,
};
How about a simple r = (n == 64) ? -1 : (1ULL<<n)-1;?
If you want to get the max value just before overflow with a given number of bits, try
r=(1ULL << n-1)+((1ULL<<n-1)-1);
By splitting the shift into two parts (in this case, two 63 bit shifts, since 2^64=2*2^63), subtracting 1 and then adding the two results together, you should be able to do the calculation without overflowing the 64 bit data type.
if (n > 64 || n < 0)
return undefined...
if (n == 64)
return 0xFFFFFFFFFFFFFFFFULL;
return (1ULL << n) - 1;
I like aviraldg answer best.
Just to get rid of the `ULL' stuff etc in C99 I would do
static inline uint64_t n2minusone(unsigned n) {
return n ? (~(uint64_t)0) >> (64u - n) : 0;
}
To see that this is valid
an uint64_t is guaranteed to have a width of exactly 64 bit
the bit negation of that `zero of type uint64_t' has thus exactly
64 one bits
right shift of an unsigned value is guaranteed to be a logical
shift, so everything is filled with zeros from the left
shift with a value equal or greater to the width is undefined, so
yes you have to do at least one conditional to be sure of your result
an inline function (or alternatively a cast to uint64_t if you
prefer) makes this type safe; an unsigned long long may
well be an 128 bit wide value in the future
a static inline function should be seamlessly
inlined in the caller without any overhead
The only problem is that your expression isn't defined for n=64? Then special-case that one value.
(n == 64 ? 0ULL : (1ULL << n)) - 1ULL
Shifting 1 << 64 in a 64 bit integer yields 0, so no need to compute anything for n > 63; shifting should be enough fast
r = n < 64 ? (1ULL << n) - 1 : 0;
But if you are trying this way to know the max value a N bit unsigned integer can have, you change 0 into the known value treating n == 64 as a special case (and you are not able to give a result for n > 64 on hardware with 64bit integer unless you use a multiprecision/bignumber library).
Another approach with bit tricks
~-(1ULL << (n-1) ) | (1ULL << (n-1))
check if it can be semplified... of course, n>0
EDIT
Tests I've done
__attribute__((regparm(0))) unsigned int calcn(int n)
{
register unsigned int res;
asm(
" cmpl $32, %%eax\n"
" jg mmno\n"
" movl $1, %%ebx\n" // ebx = 1
" subl $1, %%eax\n" // eax = n - 1
" movb %%al, %%cl\n" // because of only possible shll reg mode
" shll %%cl, %%ebx\n" // ebx = ebx << eax
" movl %%ebx, %%eax\n" // eax = ebx
" negl %%ebx\n" // -ebx
" notl %%ebx\n" // ~-ebx
" orl %%ebx, %%eax\n" // ~-ebx | ebx
" jmp mmyes\n"
"mmno:\n"
" xor %%eax, %%eax\n"
"mmyes:\n"
:
"=eax" (res):
"eax" (n):
"ebx", "ecx", "cc"
);
return res;
}
#define BMASK(X) (~-(1ULL << ((X)-1) ) | (1ULL << ((X)-1)))
int main()
{
int n = 32; //...
printf("%08X\n", BMASK(n));
printf("%08X %d %08X\n", calcn(n), n&31, BMASK(n&31));
return 0;
}
Output with n = 32 is -1 and -1, while n = 52 yields "-1" and 0xFFFFF, casually 52&31 = 20 and of course n = 20 gives 0xFFFFF...
EDIT2 now the asm code produces 0 for n > 32 (since I am on a 32 bit machine), but at this point the a ? b : 0 solution with the BMASK is clearer and I doubt the asm solution is too much faster (if speed is a so big concern the table idea could be the faster).
Since you've asked for an elegant way to do it:
const uint64_t MAX_UINT64 = 0xffffffffffffffffULL;
#define N2MINUSONE(n) ((MAX_UINT64>>(64-(n))))
I hate it that (a) n << 64 is undefined and (b) on the popular Intel hardware shifting by word size is a no-op.
You have three ways to go here:
Lookup table. I recommend against this because of the memory traffic, plus you will write a lot of code to maintain the memory traffic.
Conditional branch. Check if n is equal to the word size (8 * sizeof(unsigned long long)), if so, return ~(unsigned long long)0, otherwise shift and subtract as usual.
Try to get clever with arithmetic. For example, in real numbers 2^n = 2^(n-1) + 2^(n-1), and you can exploit this identity to make sure you never use a power equal to the word size. But you had better be very sure that n is never zero, because if it is, this identity cannot be expressed in the integers, and shifting left by -1 is likely to bite you in the ass.
I personally would go with the conditional branch—it is the hardest to screw up, manifestly handles all reasonable cases of n, and with modern hardware the likelihood of a branch misprediction is small. Here's what I do in my real code:
/* What makes things hellish is that C does not define the effects of
a 64-bit shift on a 64-bit value, and the Intel hardware computes
shifts mod 64, so that a 64-bit shift has the same effect as a
0-bit shift. The obvious workaround is to define new shift functions
that can shift by 64 bits. */
static inline uint64_t shl(uint64_t word, unsigned bits) {
assert(bits <= 64);
if (bits == 64)
return 0;
else
return word << bits;
}
I think the issue you are seeing is caused because (1<<n)-1 is evaluated as (1<<(n%64))-1 on some chips. Especially if n is or can be optimized as a constant.
Given that, there are many minor variations you can do. For example:
((1ULL<<(n/2))<<((n+1)/2))-1;
You will have to measure to see if that is faster then special casing 64:
(n<64)?(1ULL<<n)-1:~0ULL;
It is true that in C each bit-shifting operation has to shift by less bits than there are bits in the operand (otherwise, the behavior is undefined). However, nobody prohibits you from doing the shift in two consecutive steps
r = ((1ULL << (n - 1)) << 1) - 1;
I.e. shift by n - 1 bits first and then make an extra 1 bit shift. In this case, of course, you have to handle n == 0 situation in a special way, if that is a valid input in your case.
In any case, it is better than your for cycle. The latter is basically the same idea but taken to the extreme for some reason.
Ub = universe in bits = lg(U):
high(v) = v >> (Ub / 2)
low(v) = v & ((~0) >> (Ub - Ub / 2)) // Deal with overflow and with Ub even or odd
You can exploit integer division inaccuracy and use the modulo of the exponent to ensure you always shift in the range [0, (sizeof(uintmax_t) * CHAR_BIT) - 1] to create a universal pow2i function for integers of the largest supported native word size, however, this can easily be tweaked to support arbitrary word sizes.
I honestly don't get why this isn't just the implementation in hardware for bit shift overflows.
#include <limits.h>
static inline uintmax_t pow2i(uintmax_t exponent) {
#define WORD_BITS ( sizeof(uintmax_t) * CHAR_BIT )
return ((uintmax_t) 1) << (exponent / WORD_BITS) << (exponent % WORD_BITS);
#undef WORD_BITS
}
From there, you can calculate pow2i(n) - 1.