I want to calculate 2n-1 for a 64bit integer value.
What I currently do is this
for(i=0; i<n; i++) r|=1<<i;
and I wonder if there is more elegant way to do it.
The line is in an inner loop, so I need it to be fast.
I thought of
r=(1ULL<<n)-1;
but it doesn't work for n=64, because << is only defined
for values of n up to 63.
EDIT:
Thanks for all your answers and comments.
Here is a little table with the solutions that I tried and liked best.
Second column is time in seconds of my (completely unscientific) benchmark.
r=N2MINUSONE_LUT[n]; 3.9 lookup table = fastest, answer by aviraldg
r =n?~0ull>>(64 - n):0ull; 5.9 fastest without LUT, comment by Christoph
r=(1ULL<<n)-1; 5.9 Obvious but WRONG!
r =(n==64)?-1:(1ULL<<n)-1; 7.0 Short, clear and quite fast, answer by Gabe
r=((1ULL<<(n/2))<<((n+1)/2))-1; 8.2 Nice, w/o spec. case, answer by drawnonward
r=(1ULL<<n-1)+((1ULL<<n-1)-1); 9.2 Nice, w/o spec. case, answer by David Lively
r=pow(2, n)-1; 99.0 Just for comparison
for(i=0; i<n; i++) r|=1<<i; 123.7 My original solution = lame
I accepted
r =n?~0ull>>(64 - n):0ull;
as answer because it's in my opinion the most elegant solution.
It was Christoph who came up with it at first, but unfortunately he only posted it in a
comment. Jens Gustedt added a really nice rationale, so I accept his answer instead. Because I liked Aviral Dasgupta's lookup table solution it got 50 reputation points via a bounty.
Use a lookup table. (Generated by your present code.) This is ideal, since the number of values is small, and you know the results already.
/* lookup table: n -> 2^n-1 -- do not touch */
const static uint64_t N2MINUSONE_LUT[] = {
0x0,
0x1,
0x3,
0x7,
0xf,
0x1f,
0x3f,
0x7f,
0xff,
0x1ff,
0x3ff,
0x7ff,
0xfff,
0x1fff,
0x3fff,
0x7fff,
0xffff,
0x1ffff,
0x3ffff,
0x7ffff,
0xfffff,
0x1fffff,
0x3fffff,
0x7fffff,
0xffffff,
0x1ffffff,
0x3ffffff,
0x7ffffff,
0xfffffff,
0x1fffffff,
0x3fffffff,
0x7fffffff,
0xffffffff,
0x1ffffffff,
0x3ffffffff,
0x7ffffffff,
0xfffffffff,
0x1fffffffff,
0x3fffffffff,
0x7fffffffff,
0xffffffffff,
0x1ffffffffff,
0x3ffffffffff,
0x7ffffffffff,
0xfffffffffff,
0x1fffffffffff,
0x3fffffffffff,
0x7fffffffffff,
0xffffffffffff,
0x1ffffffffffff,
0x3ffffffffffff,
0x7ffffffffffff,
0xfffffffffffff,
0x1fffffffffffff,
0x3fffffffffffff,
0x7fffffffffffff,
0xffffffffffffff,
0x1ffffffffffffff,
0x3ffffffffffffff,
0x7ffffffffffffff,
0xfffffffffffffff,
0x1fffffffffffffff,
0x3fffffffffffffff,
0x7fffffffffffffff,
0xffffffffffffffff,
};
How about a simple r = (n == 64) ? -1 : (1ULL<<n)-1;?
If you want to get the max value just before overflow with a given number of bits, try
r=(1ULL << n-1)+((1ULL<<n-1)-1);
By splitting the shift into two parts (in this case, two 63 bit shifts, since 2^64=2*2^63), subtracting 1 and then adding the two results together, you should be able to do the calculation without overflowing the 64 bit data type.
if (n > 64 || n < 0)
return undefined...
if (n == 64)
return 0xFFFFFFFFFFFFFFFFULL;
return (1ULL << n) - 1;
I like aviraldg answer best.
Just to get rid of the `ULL' stuff etc in C99 I would do
static inline uint64_t n2minusone(unsigned n) {
return n ? (~(uint64_t)0) >> (64u - n) : 0;
}
To see that this is valid
an uint64_t is guaranteed to have a width of exactly 64 bit
the bit negation of that `zero of type uint64_t' has thus exactly
64 one bits
right shift of an unsigned value is guaranteed to be a logical
shift, so everything is filled with zeros from the left
shift with a value equal or greater to the width is undefined, so
yes you have to do at least one conditional to be sure of your result
an inline function (or alternatively a cast to uint64_t if you
prefer) makes this type safe; an unsigned long long may
well be an 128 bit wide value in the future
a static inline function should be seamlessly
inlined in the caller without any overhead
The only problem is that your expression isn't defined for n=64? Then special-case that one value.
(n == 64 ? 0ULL : (1ULL << n)) - 1ULL
Shifting 1 << 64 in a 64 bit integer yields 0, so no need to compute anything for n > 63; shifting should be enough fast
r = n < 64 ? (1ULL << n) - 1 : 0;
But if you are trying this way to know the max value a N bit unsigned integer can have, you change 0 into the known value treating n == 64 as a special case (and you are not able to give a result for n > 64 on hardware with 64bit integer unless you use a multiprecision/bignumber library).
Another approach with bit tricks
~-(1ULL << (n-1) ) | (1ULL << (n-1))
check if it can be semplified... of course, n>0
EDIT
Tests I've done
__attribute__((regparm(0))) unsigned int calcn(int n)
{
register unsigned int res;
asm(
" cmpl $32, %%eax\n"
" jg mmno\n"
" movl $1, %%ebx\n" // ebx = 1
" subl $1, %%eax\n" // eax = n - 1
" movb %%al, %%cl\n" // because of only possible shll reg mode
" shll %%cl, %%ebx\n" // ebx = ebx << eax
" movl %%ebx, %%eax\n" // eax = ebx
" negl %%ebx\n" // -ebx
" notl %%ebx\n" // ~-ebx
" orl %%ebx, %%eax\n" // ~-ebx | ebx
" jmp mmyes\n"
"mmno:\n"
" xor %%eax, %%eax\n"
"mmyes:\n"
:
"=eax" (res):
"eax" (n):
"ebx", "ecx", "cc"
);
return res;
}
#define BMASK(X) (~-(1ULL << ((X)-1) ) | (1ULL << ((X)-1)))
int main()
{
int n = 32; //...
printf("%08X\n", BMASK(n));
printf("%08X %d %08X\n", calcn(n), n&31, BMASK(n&31));
return 0;
}
Output with n = 32 is -1 and -1, while n = 52 yields "-1" and 0xFFFFF, casually 52&31 = 20 and of course n = 20 gives 0xFFFFF...
EDIT2 now the asm code produces 0 for n > 32 (since I am on a 32 bit machine), but at this point the a ? b : 0 solution with the BMASK is clearer and I doubt the asm solution is too much faster (if speed is a so big concern the table idea could be the faster).
Since you've asked for an elegant way to do it:
const uint64_t MAX_UINT64 = 0xffffffffffffffffULL;
#define N2MINUSONE(n) ((MAX_UINT64>>(64-(n))))
I hate it that (a) n << 64 is undefined and (b) on the popular Intel hardware shifting by word size is a no-op.
You have three ways to go here:
Lookup table. I recommend against this because of the memory traffic, plus you will write a lot of code to maintain the memory traffic.
Conditional branch. Check if n is equal to the word size (8 * sizeof(unsigned long long)), if so, return ~(unsigned long long)0, otherwise shift and subtract as usual.
Try to get clever with arithmetic. For example, in real numbers 2^n = 2^(n-1) + 2^(n-1), and you can exploit this identity to make sure you never use a power equal to the word size. But you had better be very sure that n is never zero, because if it is, this identity cannot be expressed in the integers, and shifting left by -1 is likely to bite you in the ass.
I personally would go with the conditional branch—it is the hardest to screw up, manifestly handles all reasonable cases of n, and with modern hardware the likelihood of a branch misprediction is small. Here's what I do in my real code:
/* What makes things hellish is that C does not define the effects of
a 64-bit shift on a 64-bit value, and the Intel hardware computes
shifts mod 64, so that a 64-bit shift has the same effect as a
0-bit shift. The obvious workaround is to define new shift functions
that can shift by 64 bits. */
static inline uint64_t shl(uint64_t word, unsigned bits) {
assert(bits <= 64);
if (bits == 64)
return 0;
else
return word << bits;
}
I think the issue you are seeing is caused because (1<<n)-1 is evaluated as (1<<(n%64))-1 on some chips. Especially if n is or can be optimized as a constant.
Given that, there are many minor variations you can do. For example:
((1ULL<<(n/2))<<((n+1)/2))-1;
You will have to measure to see if that is faster then special casing 64:
(n<64)?(1ULL<<n)-1:~0ULL;
It is true that in C each bit-shifting operation has to shift by less bits than there are bits in the operand (otherwise, the behavior is undefined). However, nobody prohibits you from doing the shift in two consecutive steps
r = ((1ULL << (n - 1)) << 1) - 1;
I.e. shift by n - 1 bits first and then make an extra 1 bit shift. In this case, of course, you have to handle n == 0 situation in a special way, if that is a valid input in your case.
In any case, it is better than your for cycle. The latter is basically the same idea but taken to the extreme for some reason.
Ub = universe in bits = lg(U):
high(v) = v >> (Ub / 2)
low(v) = v & ((~0) >> (Ub - Ub / 2)) // Deal with overflow and with Ub even or odd
You can exploit integer division inaccuracy and use the modulo of the exponent to ensure you always shift in the range [0, (sizeof(uintmax_t) * CHAR_BIT) - 1] to create a universal pow2i function for integers of the largest supported native word size, however, this can easily be tweaked to support arbitrary word sizes.
I honestly don't get why this isn't just the implementation in hardware for bit shift overflows.
#include <limits.h>
static inline uintmax_t pow2i(uintmax_t exponent) {
#define WORD_BITS ( sizeof(uintmax_t) * CHAR_BIT )
return ((uintmax_t) 1) << (exponent / WORD_BITS) << (exponent % WORD_BITS);
#undef WORD_BITS
}
From there, you can calculate pow2i(n) - 1.
Related
int n_b ( char *addr , int i ) {
char char_in_chain = addr [ i / 8 ] ;
return char_in_chain >> i%8 & 0x1;
}
Like what is that : " i%8 & Ox1" ?
Edit: Note that 0x1 is the hexadecimal notation for 1. Also note that :
0x1 = 0x01 = 0x000001 = 0x0...01
i%8 means i modulo 8, ie the rest in the Euclidean division of i by 8.
& 0x1 is a bitwise AND, it converts the number before to binary form then computes the bitwise operation. (it's already in binary but it's just so you understand)
Example : 0x1101 & 0x1001 = 0x1001
Note that any number & 0x1 is either 0 or one.
Example: 0x11111111 & 0x00000001 is 0x1 and 0x11111110 & 0x00000001 is 0x0
Essentially, it is testing the last bit on the number, which the bit determining parity.
Final edit:
I got the precedence wrong, thanks to the comments for pointing it out. Here is the real precedence.
First, we compute i%8.
The result could be 0, 1, 2, 3, 4, 5, 6, 7.
Then, we shift the char by the result, which is maximum 7. That means the i % 8 th bit is now the least significant bit.
Then, we check if the original i % 8 bit is set (equals one) or not. If it is, return 1. Else, return 0.
This function returns the value of a specific bit in a char array as the integer 0 or 1.
addr is the pointer to the first char.
i is the index to the bit. 8 bits are commonly stored in a char.
First, the char at the correct offset is fetched:
char char_in_chain = addr [ i / 8 ] ;
i / 8 divides i by 8, ignoring the remainder. For example, any value in the range from 24 to 31 gives 3 as the result.
This result is used as the index to the char in the array.
Next and finally, the bit is obtained and returned:
return char_in_chain >> i%8 & 0x1;
Let's just look at the expression char_in_chain >> i%8 & 0x1.
It is confusing, because it does not show which operation is done in what sequence. Therefore, I duplicate it with appropriate parentheses: (char_in_chain >> (i % 8)) & 0x1. The rules (operation precedence) are given by the C standard.
First, the remainder of the division of i by 8 is calculated. This is used to right-shift the obtained char_in_chain. Now the interesting bit is in the least significant bit. Finally, this bit is "masked" with the binary AND operator and the second operand 0x1. BTW, there is no need to mark this constant as hex.
Example:
The array contains the bytes 0x5A, 0x23, and 0x42. The index of the bit to retrieve is 13.
i as given as argument is 13.
i / 8 gives 13 / 8 = 1, remainder ignored.
addr[1] returns 0x23, which is stored in char_in_chain.
i % 8 gives 5 (13 / 8 = 1, remainder 5).
0x23 is binary 0b00100011, and right-shifted by 5 gives 0b00000001.
0b00000001 ANDed with 0b00000001 gives 0b00000001.
The value returned is 1.
Note: If more is not clear, feel free to comment.
What the various operators do is explained by any C book, so I won't address that here. To instead analyse the code step by step...
The function and types used:
int as return type is an indication of the programmer being inexperienced at writing hardware-related code. We should always avoid signed types for such purposes. An experienced programmer would have used an unsigned type, like for example uint8_t. (Or in this specific case maybe even bool, depending on what the data is supposed to represent.)
n_b is a rubbish name, we should obviously never give an identifier such a nondescript name. get_bit or similar would have been a better name.
char* is, again, an indication of the programmer being inexperienced. char is particularly problematic when dealing with raw data, since we can't even know if it is signed or unsigned, it depends on which compiler that is used. Had the raw data contained a value of 0x80 or larger and char was negative, we would have gotten a negative type. And then right shifting a negative value is also problematic, since that behavior too is compiler-specific.
char* is proof of the programmer lacking the fundamental knowledge of const correctness. The function does not modify this parameter so it should have been const qualified. Good code would use const uint8_t* addr.
int i is not really incorrect, the signedness doesn't really matter. But good programming practice would have used an unsigned type or even size_t.
With types unsloppified and corrected, the function might look like this:
#include <stdint.h>
uint8_t get_bit (const uint8_t* addr, size_t i ) {
uint8_t char_in_chain = addr [ i / 8 ] ;
return char_in_chain >> i%8 & 0x1;
}
This is still somewhat problematic, because the average C programmer might not remember the precedence of >> vs % vs & on top of their head. It happens to be % over >> over &, but lets write the code a bit more readable still by making precedence explicit: (char_in_chain >> (i%8)) & 0x1.
Then I would question if the local variable really adds anything to readability. Not really, we might as well write:
uint8_t get_bit (const uint8_t* addr, size_t i ) {
return ((addr[i/8]) >> (i%8)) & 0x1;
}
As for what this code actually does: this happens to be a common design pattern for how to access a specific bit in a raw bit-field.
Any bit-field in C may be accessed as an array of bytes.
Bit number n in that bit-field, will be found at byte n/8.
Inside that byte, the bit will be located at n%8.
Bit masking in C is most readably done as data & (1u << bit). Which can be obfuscated as somewhat equivalent but less readable (data >> bit) & 1u, where the masked bit ends up in the LSB.
For example lets assume we have 64 bits of raw data. Bits are always enumerated from 0 to 63 and bytes (just like any C array) from index 0. We want to access bit 33. Then 33/8 integer division = 4.
So byte[4]. Bit 33 will be found at 33%8 = 1. So we can obtain the value of bit 33 from ordinary bit masking byte[33/8] & (1u << (bit%8)). Or similarly, (byte[33/8] >> (bit%8)) & 1u
An alternative, more readable version of it all:
bool is_bit_set (const uint8_t* data, size_t bit)
{
uint8_t byte = data [bit / 8u];
size_t mask = 1u << (bit % 8u);
return (byte & mask) != 0u;
}
(Strictly speaking we could as well do return byte & mask; since a boolean type is used, but it doesn't hurt to be explicit.)
In C when you do something like this:
char var = 1;
while(1)
{
var = var << 1;
}
In the 8th iteration the "<<" operator will shift out the 1 and var will be 0. I need to perform a shift in order to mantain the bit shifting. In other words I need this:
initial ----- 00000001
1st shift -- 00000010
2nd shift - 00000100
3rd shift - 00001000
4th shift - 00010000
5th shift -- 00100000
6th shift -- 01000000
7th shift - 10000000
8th shift - 00000001 (At the 8th shift the one automatically start again)
Is there something equivalent to "<<" but to achieve this?
This is known as a circular shift, but C doesn't offer this functionality at the language level.
You will either have to implement this yourself, or resort to inline assembler routines, assuming your platform natively has such an instruction.
For example:
var = (var << 1) | (var >> 7);
(This is not well-defined for negative signed types, though, so you'd have to change your example to unsigned char.)
Yes, you can use a circular shift. (Although it isn't a built-in C operation, but it is a CPU instruction on x86 CPUs)
So you want to do a bit rotation, a.k.a. circular shift, then.
#include <limits.h> // Needed for CHAR_BIT
// positive numbits -> right rotate, negative numbits -> left rotate
#define ROTATE(type, var, numbits) ((numbits) >= 0 ? \
(var) >> (numbits) | (var) << (CHAR_BIT * sizeof(type) - (numbits)) : \
(var) << -(numbits) | (var) >> (CHAR_BIT * sizeof(type) + (numbits)))
As sizeof() returns sizes as multiples of the size of char (sizeof(char) == 1), and CHAR_BIT indicates the number of bits in a char (which, while usually 8, won't necessarily be), CHAR_BIT * sizeof(x) will give you the size of x in bits.
This is called a circular shift. There are intel x86 assembly instructions to do this but unless performance is REALLY REALLY A HUGE ISSUE you're better off using something like this:
int i = 0x42;
int by = 13;
int shifted = i << by | i >> ((sizeof(int) * 8) - by);
If you find yourself really needing the performance, you can use inline assembly to use the instructions directly (probably. I've never needed it badly enough to try).
It's also important to note that if you're going to be shifting by more places than the size of your data type, you need additional checks to make sure you're not overshifting. Using by = 48 would probably result in shifted receiving a value of 0, though this behavior may be platform specific (i.e. something to avoid like the plague) because if I recall correctly, some platforms perform this masking automatically and others do not.
I am somewhat curious about creating a macro to generate a bit mask for a device register, up to 64bits. Such that BIT_MASK(31) produces 0xffffffff.
However, several C examples do not work as thought, as I get 0x7fffffff instead. It is as-if the compiler is assuming I want signed output, not unsigned. So I tried 32, and noticed that the value wraps back around to 0. This is because of C standards stating that if the shift value is greater than or equal to the number of bits in the operand to be shifted, then the result is undefined. That makes sense.
But, given the following program, bits2.c:
#include <stdio.h>
#define BIT_MASK(foo) ((unsigned int)(1 << foo) - 1)
int main()
{
unsigned int foo;
char *s = "32";
foo = atoi(s);
printf("%d %.8x\n", foo, BIT_MASK(foo));
foo = 32;
printf("%d %.8x\n", foo, BIT_MASK(foo));
return (0);
}
If I compile with gcc -O2 bits2.c -o bits2, and run it on a Linux/x86_64 machine, I get the following:
32 00000000
32 ffffffff
If I take the same code and compile it on a Linux/MIPS (big-endian) machine, I get this:
32 00000000
32 00000000
On the x86_64 machine, if I use gcc -O0 bits2.c -o bits2, then I get:
32 00000000
32 00000000
If I tweak BIT_MASK to ((unsigned int)(1UL << foo) - 1), then the output is 32 00000000 for both forms, regardless of gcc's optimization level.
So it appears that on x86_64, gcc is optimizing something incorrectly OR the undefined nature of left-shifting 32 bits on a 32-bit number is being determined by the hardware of each platform.
Given all of the above, is it possible to programatically create a C macro that creates a bit mask from either a single bit or a range of bits?
I.e.:
BIT_MASK(6) = 0x40
BIT_FIELD_MASK(8, 12) = 0x1f00
Assume BIT_MASK and BIT_FIELD_MASK operate from a 0-index (0-31). BIT_FIELD_MASK is to create a mask from a bit range, i.e., 8:12.
Here is a version of the macro which will work for arbitrary positive inputs. (Negative inputs still invoke undefined behavior...)
#include <limits.h>
/* A mask with x least-significant bits set, possibly 0 or >=32 */
#define BIT_MASK(x) \
(((x) >= sizeof(unsigned) * CHAR_BIT) ?
(unsigned) -1 : (1U << (x)) - 1)
Of course, this is a somewhat dangerous macro as it evaluates its argument twice. This is a good opportunity to use a static inline if you use GCC or target C99 in general.
static inline unsigned bit_mask(int x)
{
return (x >= sizeof(unsigned) * CHAR_BIT) ?
(unsigned) -1 : (1U << x) - 1;
}
As Mysticial noted, shifting more than 32 bits with a 32-bit integer results in implementation-defined undefined behavior. Here are three different implementations of shifting:
On x86, only examine the low 5 bits of the shift amount, so x << 32 == x.
On PowerPC, only examine the low 6 bits of the shift amount, so x << 32 == 0 but x << 64 == x.
On Cell SPUs, examine all bits, so x << y == 0 for all y >= 32.
However, compilers are free to do whatever they want if you shift a 32-bit operand 32 bits or more, and they are even free to behave inconsistently (or make demons fly out your nose).
Implementing BIT_FIELD_MASK:
This will set bit a through bit b (inclusive), as long as 0 <= a <= 31 and 0 <= b <= 31.
#define BIT_MASK(a, b) (((unsigned) -1 >> (31 - (b))) & ~((1U << (a)) - 1))
Shifting by more than or equal to the size of the integer type is undefined behavior.
So no, it's not a GCC bug.
In this case, the literal 1 is of type int which is 32-bits in both systems that you used. So shifting by 32 will invoke this undefined behavior.
In the first case, the compiler is not able to resolve the shift-amount to 32. So it likely just issues the normal shift-instruction. (which in x86 uses only the bottom 5-bits) So you get:
(unsigned int)(1 << 0) - 1
which is zero.
In the second case, GCC is able to resolve the shift-amount to 32. Since it is undefined behavior, it (apparently) just replaces the entire result with 0:
(unsigned int)(0) - 1
so you get ffffffff.
So this is a case of where GCC is using undefined behavior as an opportunity to optimize.
(Though personally, I'd prefer that it emits a warning instead.)
Related: Why does integer overflow on x86 with GCC cause an infinite loop?
Assuming you have a working mask for n bits, e.g.
// set the first n bits to 1, rest to 0
#define BITMASK1(n) ((1ULL << (n)) - 1ULL)
you can make a range bitmask by shifting again:
// set bits [k+1, n] to 1, rest to 0
#define BITNASK(n, k) ((BITMASK(n) >> k) << k)
The type of the result is unsigned long long int in any case.
As discussed, BITMASK1 is UB unless n is small. The general version requires a conditional and evaluates the argument twice:
#define BITMASK1(n) (((n) < sizeof(1ULL) * CHAR_BIT ? (1ULL << (n)) : 0) - 1ULL)
#define BIT_MASK(foo) ((~ 0ULL) >> (64-foo))
I'm a bit paranoid about this. I think this assumes that unsigned long long is exactly 64 bits. But it's a start and it works up to 64 bits.
Maybe this is correct:
define BIT_MASK(foo) ((~ 0ULL) >> (sizeof(0ULL)*8-foo))
A "traditional" formula (1ul<<n)-1 has different behavior on different compilers/processors for n=8*sizeof(1ul). Most commonly it overflows for n=32. Any added conditionals will evaluate n multiple times. Going 64-bits (1ull<<n)-1 is an option, but problem migrates to n=64.
My go-to formula is:
#define BIT_MASK(n) (~( ((~0ull) << ((n)-1)) << 1 ))
It does not overflow for n=64 and evaluates n only once.
As downside it will compile to 2 LSH instructions if n is a variable. Also n cannot be 0 (result will be compiler/processor-specific), but it is a rare possibility for all uses that I have(*) and can be dealt with by adding a guarding "if" statement only where necessary (and even better an "assert" to check both upper and lower boundaries).
(*) - usually data comes from a file or pipe, and size is in bytes. If size is zero, then there's no data, so code should do nothing anyway.
What about:
#define BIT_MASK(n) (~(((~0ULL) >> (n)) << (n)))
This works on all endianess system, doing -1 to invert all bits doesn't work on big-endian system.
Since you need to avoid shifting by as many bits as there are in the type (whether that's unsigned long or unsigned long long), you have to be more devious in the masking when dealing with the full width of the type. One way is to sneak up on it:
#define BIT_MASK(n) (((n) == CHAR_BIT * sizeof(unsigned long long)) ? \
((((1ULL << (n-1)) - 1) << 1) | 1) : \
((1ULL << (n )) - 1))
For a constant n such as 64, the compiler evaluates the expression and generates only the case that is used. For a runtime variable n, this fails just as badly as before if n is greater than the number of bits in unsigned long long (or is negative), but works OK without overflow for values of n in the range 0..(CHAR_BIT * sizeof(unsigned long long)).
Note that CHAR_BIT is defined in <limits.h>.
#iva2k's answer avoids branching and is correct when the length is 64 bits. Working on that, you can also do this:
#define BIT_MASK(length) ~(((unsigned long long) -2) << length - 1);
gcc would generate exactly the same code anyway, though.
Suppose I have an increasing sequence of unsigned integers C[i]. As they increase, it's likely that they will occupy increasingly many bits. I'm looking for an efficient conditional, based purely on two consecutive elements of the sequence C[i] and C[i+1] (past and future ones are not observable), that will evaluate to true either exactly or approximately once for every time the number of bits required increases.
An obvious (but slow) choice of conditional is:
if (ceil(log(C[i+1])) > ceil(log(C[i]))) ...
and likewise anything that computes the number of leading zero bits using special cpu opcodes (much better but still not great).
I suspect there may be a nice solution involving an expression using just bitwise or and bitwise and on the values C[i+1] and C[i]. Any thoughts?
Suppose your two numbers are x and y. If they have the same high order bit, then x^y is less than both x and y. Otherwise, it is higher than one of the two.
So
v = x^y
if (v > x || v > y) { ...one more bit... }
I think you just need clz(C[i+1]) < clz(C[i]) where clz is a function which returns the number of leading zeroes ("count leading zeroes"). Some CPU families have an instruction for this (which may be available as an instrinsic). If not then you have to roll your own (it typically only takes a few instructions) - see Hacker's Delight.
Given (I believe this comes from Hacker's Delight):
int hibit(unsigned int n) {
n |= (n >> 1);
n |= (n >> 2);
n |= (n >> 4);
n |= (n >> 8);
n |= (n >> 16);
return n - (n >> 1);
}
Your conditional is simply hibit(C[i]) != hibit(C[i+1]).
BSR - Bit Scan Reverse (386+)
Usage: BSR dest,src
Modifies flags: ZF
Scans source operand for first bit set. Sets ZF if a bit is found
set and loads the destination with an index to first set bit. Clears
ZF is no bits are found set. BSF scans forward across bit pattern
(0-n) while BSR scans in reverse (n-0).
Clocks Size
Operands 808x 286 386 486 Bytes
reg,reg - - 10+3n 6-103 3
reg,mem - - 10+3n 7-104 3-7
reg32,reg32 - - 10+3n 6-103 3-7
reg32,mem32 - - 10+3n 7-104 3-7
You need two of these (on C[i] and C[i]+1) and a compare.
Keith Randall's solution is good, but you can save one xor instruction by using the following code which processes the entire sequence in O(w + n) instructions, where w is the number of bits in a word, and n is the number of elements in the sequence. If the sequence is long, most iterations will only involve one comparison, avoiding one xor instruction.
This is accomplished by tracking the highest power of two that has been reached as follows:
t = 1; // original setting
if (c[i + 1] >= t) {
do {
t <<= 1;
} while (c[i + 1] >= t); // watch for overflow
... // conditional code here
}
The number of bits goes up when the value is about overflow a power of two. A simple test is then, is the value equal to a power of two, minus 1? This can be accomplished by asking:
if ((C[i] & (C[i]+1))==0) ...
The number of bits goes up when the value is about to overflow a power of two.
A simple test is then:
while (C[i] >= (1<<number_of_bits)) then number_of_bits++;
If you want it even faster:
int number_of_bits = 1;
int two_to_number_of_bits = 1<<number_of_bits ;
... your code ....
while ( C[i]>=two_to_number_of_bits )
{ number_of_bits++;
two_to_number_of_bits = 1<<number_of_bits ;
}
I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=(nVal&bMask1) + nVal; //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
EDIT: Note on isel()
I saw your isel() code on your website.
// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
return (x & (~mask)) + (y & mask);
How about this:
if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;
will probably take longer yet to execute but easier to interleave if you have other code to go between.
Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);
The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.
And then the driver:
SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...
This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).
How about this:
int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};
int ShiftByVar( int x, int y )
{
//return x << y;
return x * multiplicands[y];
}
If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together
There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.
Here's something that is trivially unrollable:
int result= value;
int shift_accumulator= value;
for (int i= 0; i<5; ++i)
{
result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
shift_accumulator += shift_accumulator;
k >>= 1;
}