Emulating variable bit-shift using only constant shifts?

Emulating variable bit-shift using only constant shifts? - c

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.
The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like
int ShiftByConstant( int x ) { return x << 3 ; }
is fast, single-op, and superscalar, whereas a shift-by-variable, like
int ShiftByVar( int x, int y ) { return x << y ; }
is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.
What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.
I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)
This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.
Edit: A couple of clarifications that I should add:
I'm not even a little bit worried about portability
PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function
int isel(a, b, c) { return a >= 0 ? b : c; }
(if you write out a ternary that does the same thing I'll get what you mean)
integer multiplication is also microcoded and even slower than sraw. :-(
On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.

Here you go...
I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).
The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.
template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=(nVal&bMask1) + nVal; //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{ // 31-bit shift capability (Rolls over at 32-bits)
const int bMask1=-(1&nShift);
const int bMask2=-(1&(nShift>>1));
const int bMask3=-(1&(nShift>>2));
const int bMask4=-(1&(nShift>>3));
const int bMask5=-(1&(nShift>>4));
nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
return(nVal);
}
EDIT: Note on isel()
I saw your isel() code on your website.
// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:
return (x & (~mask)) + (y & mask);

How about this:
if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;
will probably take longer yet to execute but easier to interleave if you have other code to go between.

Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.
Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:
#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);
The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.
And then the driver:
SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)
Note that N is a macro variable and becomes constant.
Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...

This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.
The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).

How about this:
int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};
int ShiftByVar( int x, int y )
{
//return x << y;
return x * multiplicands[y];
}

If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together

There is some good stuff here regarding bit manipulation black magic:
Advanced bit manipulation fu (Christer Ericson's blog)
Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.

Here's something that is trivially unrollable:
int result= value;
int shift_accumulator= value;
for (int i= 0; i<5; ++i)
{
result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
shift_accumulator += shift_accumulator;
k >>= 1;
}

Related

Efficient modulo-255 computation

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.
The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.
The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.
A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20
showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.
An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.
Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.
Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NAIVE_USING_DIV (1)
#define DIGIT_SUM_THEN_FOLD (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC (6) // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1 (7) // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2 (8) // Hacker's Delight, 2nd ed., p. 272
#define VARIANT (WARREN_MUL_SHR_2)
uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
x = (x & 0xffff) + (x >> 16);
x = (x & 0xff) + (x >> 8);
x = (x & 0xff) + (x >> 8) + 1;
x = (x & 0xff) + (x >> 8) - 1;
return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
if (t == 255) t = 0;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x) + 1;
t = (t & 0xff) + (t >> 8) - 1;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
t = t & ((t - 255) >> 8);
return t;
#elif VARIANT == KEANE_MAGIC
x = (((x >> 16) + x) >> 14) + (x << 2);
x = ((x >> 8) + x + 2) & 0x3ff;
x = (x - (x >> 8)) >> 2;
return x;
#elif VARIANT == WARREN_MUL_SHR_1
x = (0x01010101 * x + (x >> 8)) >> 24;
x = x & ((x - 255) >> 8);
return x;
#elif VARIANT == WARREN_MUL_SHR_2
x = (0x01010101 * x + (x >> 8)) >> 24;
if (x == 255) x = 0;
return x;
#else
#error unknown VARIANT
#endif
}
uint32_t ref_mod255 (uint32_t x)
{
volatile uint32_t t = x;
t = t % 255;
return t;
}
// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
int main (void)
{
double start, stop;
uint32_t res, ref, x = 0;
printf ("Testing VARIANT = %d\n", VARIANT);
start = second();
do {
res = mod255 (x);
ref = ref_mod255 (x);
if (res != ref) {
printf ("error # %08x: res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
stop = second();
printf ("test passed\n");
printf ("elapsed = %.6f seconds\n", stop - start);
return EXIT_SUCCESS;
}

For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:
quotient = x / n;
product = quotient * n;
modulus = x - product;
However, when n is a power of 2 (n = 2p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.
On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).
Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1)†. Thus, our conceptual operations are now: division, addition and masking.
In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.
Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):
unsigned Naive255(unsigned x)
{
return x % 255;
}
mov edx, ecx
mov eax, 2155905153 ;
imul rax, rdx ; Replacing the IDIV with IMUL and SHR
shr rax, 39 ;
mov edx, eax
shl edx, 8
sub eax, edx
add eax, ecx
And here's the (clearly faster) code generated using the 'trick' described above:
unsigned Trick255(unsigned x)
{
return (x + x / 255) & 0xFF;
}
mov eax, ecx
mov edx, 2155905153
imul rdx, rax
shr rdx, 39
add edx, ecx
movzx eax, dl ; Faster than an explicit AND mask?
Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.
† The answer given by David Eisenstat explains how/why this equivalence is valid.

Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.
Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.
Adrian Mole’s method
This method evaluates (x + ⌊x/255⌋) mod 28 (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 28 = (28 q + r) mod 28 = r.
Henry S. Warren’s method
Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(28/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (20 + 2−8 + 2−16 + 2−24 + 2−32) instead of 28/255, which is the sum of the infinite series 20 + 2−8 + 2−16 + 2−24 + 2−32 + …. Since the approximation is slightly under, this method must detect the residue 28 − 1 = 255.
Joe Keane’s method
The intuition for this method is to compute y = (28/255) x mod 28, which equals (28/255) (255 q + r) mod 28 = (28 q + (28/255) r) mod 28 = (28/255) r, and return y − y/28, which equals r.
Since these formulas don’t use the fact that ⌊(28/255) r⌋ = r, Keane can switch from 28 to 210 for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 210/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.
This method sort of uses the multiplier 22 (20 + 2−8 + 2−16 + 2−24 + 2−32 + 2−40) = 22 (20 + 2−16 + 2−32) (20 + 2−8). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊22 (20 + 2−16 + 2−32) x⌋ mod 232. Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(20 + 2−8) x′⌋ mod 210.
I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2−8; the third, width < ((2 − 2−8) + 1)/22 < 1, which allows correct rounding.
Regarding improvements, the 2−40 term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2−32 term. Dropping 2−32 pushes the approximation quality out of spec.

Guess you're probably not looking for solutions that require fast 64-bit multiplication, but for the record:
return (x * 0x101010101010102ULL) >> 56;

This method (improved slightly since the previous edit) mashes up Warren and Keane. On my laptop, it’s faster than Keane but not as fast as a 64-bit multiply and shift. It avoids multiplication but benefits from a single rotate instruction. Unlike the original version, it’s probably OK on RISC-V.
Like Warren, this method approximates ⌊(256/255) x mod 256⌋ in 8.24 fixed point. Mod 256, each byte b contributes a term (256/255) b, which is approximately b.bbb base 256. The original version of this method just sums all four byte rotations. (I’ll get to the revised version in a moment.) This sum always underestimates the real value, but by less than 4 units in the last place. By adding 4/2−24 before truncating, we guarantee the right answer as in Keane.
The revised version saves work by relaxing the approximation quality. We write (256/255) x = (257/256) (65536/65535) x, evaluate (65536/65535) x in 16.16 fixed point (i.e., add x to its 16-bit rotation), and then multiply by 257/256 and mod by 256 into 8.24 fixed point. The first multiplication has error less than 2 units in the last place of 16.16, and the second is exact (!). The sum underestimates by less than (2/216) (257/256), so a constant term of 514/224 suffices to fix the truncation. It’s also possible to use a greater value in case a different immediate operand is more efficient.
uint32_t mod255(uint32_t x) {
x += (x << 16) | (x >> 16);
return ((x << 8) + x + 514) >> 24;
}

If we were to have a builtin, intrinsic, or method that is optimised to single instruction addc, one could use 32-bit arithmetic in the following way:
uint32_t carry = 0;
// sum up top and bottom 16 bits while generating carry out
x = __builtin_addc(x, x<<16, carry, &carry);
x &= 0xffff0000;
// store the previous carry to bit 0 while adding
// bits 16:23 over bits 24:31, and producing one more carry
x = __builtin_addc(x, x << 8, carry, &carry);
x = __builtin_addc(x, x >> 24, carry, &carry);
x &= 0x0000ffff; // actually 0x1ff is enough
// final correction for 0<=x<=257, i.e. min(x,x-255)
x = x < x-255 ? x : x - 255;
In Arm64 at least the regular add instruction can take the form of add r0, r1, r2 LSL 16; the masking with immediate or clearing consecutive bits is a single instruction bfi r0, wzr, #start_bit, #length.
For parallel calculation one can't use that efficiently widening multiplication. Instead one can divide-and-conquer while calculating carries -- starting with 16 uint32_t elements interpreted as 16+16 uint16_t elements, then moving to uint8_t arithmetic, one can calculate one result in slightly less than one instruction.
a0 = vld2q_u16(ptr); // split input to top16+bot16 bits
a1 = vld2q_u16(ptr + 8); // load more inputs
auto b0 = vaddq_u16(a0.val[0], a0.val[1]);
auto b1 = vaddq_u16(a1.val[0], a1.val[1]);
auto c0 = vcltq_u16(b0, a0.val[1]); // 8 carries
auto c1 = vcltq_u16(b1, a1.val[1]); // 8 more carries
b0 = vsubq_u16(b0, c0);
b1 = vsubq_u16(b1, c1);
auto d = vuzpq_u8(b0, b1);
auto result = vaddq_u8(d.val[0], d.val[1]);
auto carry = vcltq_u8(result, d.val[1]);
result = vsubq_u8(result, carry);
auto is_255 = vceqq_u8(result, vdupq_n_u8(255));
result = vbicq_u8(result, is_255);

Divide a signed integer by a power of 2

I'm working on a way to divide a signed integer by a power of 2 using only binary operators (<< >> + ^ ~ & | !), and the result has to be round toward 0. I came across this question also on Stackoverflow on the problem, however, I cannot understand why it works. Here's the solution:
int divideByPowerOf2(int x, int n)
{
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
I understand the x >> 31 part (only add the next part if x is negative, because if it's positive x will be automatically round toward 0). But what's bothering me is the (1 << n) + ~0 part. How can it work?

Assuming 2-complement, just bit-shifting the dividend is equivalent to a certain kind of division: not the conventional division where we round the dividend to next multiple of divisor toward zero. But another kind where we round the dividend toward negative infinity. I rediscovered that in Smalltalk, see http://smallissimo.blogspot.fr/2015/03/is-bitshift-equivalent-to-division-in.html.
For example, let's divide -126 by 8. traditionally, we would write
-126 = -15 * 8 - 6
But if we round toward infinity, we get a positive remainder and write it:
-126 = -16 * 8 + 2
The bit-shifting is performing the second operation, in term of bit patterns (assuming 8 bits long int for the sake of being short):
1000|0010 >> 3 = 1111|0000
1000|0010 = 1111|0000 * 0000|1000 + 0000|0010
So what if we want the traditional division with quotient rounded toward zero and remainder of same sign as dividend? Simple, we just have to add 1 to the quotient - if and only if the dividend is negative and the division is inexact.
You saw that x>>31 corresponds to first condition, dividend is negative, assuming int has 32 bits.
The second term corresponds to the second condition, if division is inexact.
See how are encoded -1, -2, -4, ... in two complement: 1111|1111 , 1111|1110 , 1111|1100. So the negation of nth power of two has n trailing zeros.
When the dividend has n trailing zeros and we divide by 2^n, then no need to add 1 to final quotient. In any other case, we need to add 1.
What ((1 << n) + ~0) is doing is creating a mask with n trailing ones.
The n last bits don't really matter, because we are going to shift to the right and just throw them away. So, if the division is exact, the n trailing bits of dividend are zero, and we just add n 1s that will be skipped. On the contrary, if the division is inexact, then one or more of the n trailing bits of the dividend is 1, and we are sure to cause a carry to the n+1 bit position: that's how we add 1 to the quotient (we add 2^n to the dividend). Does that explain it a bit more?

This is "write-only code": instead of trying to understand the code, try to create it by yourself.
For example, let's divide a number by 8 (shift right by 3).
If the number is negative, the normal right-shift rounds in the wrong direction. Let's "fix" it by adding a number:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + whatever) >> 3;
}
Here you can come up with a mathematical formula for whatever, or do some trial and error. Anyway, here whatever = 7:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + 7) >> 3;
}
How to unify the two cases? You need to make an expression that looks like this:
(x + stuff) >> 3
where stuff is 7 for negative x, and 0 for positive x. The trick here is using x >> 31, which is a 32-bit number whose bits are equal to the sign-bit of x: all 0 or all 1. So stuff is
(x >> 31) & 7
Combining all these, and replacing 8 and 7 by the more general power of 2, you get the code you asked about.
Note: in the description above, I assume that int represents a 32-bit hardware register, and hardware uses two's complement representation to do right shift.

OP's reference is of a C# code and so many subtle differences that cause it to be bad code with C, as this post is tagged.
int is not necessarily 32-bits so using a magic number of 32 does not make for a robust solution.
In particular (1 << n) + ~0 results in implementation defined behavior when n causes a bit to be shifted into the sign place. Not good coding.
Restricting code to only using "binary" operators << >> + ^ ~ & | ! encourages a coder to assume things about int which is not portable nor compliant with the C spec. So OP's posted code does not "work" in general, although may work in many common implementations.
OP code fails when int is not 2's complement, not uses the range [-2147483648 .. 2147483647] or when 1 << n uses implementation behavior that is not as expected.
// weak code
int divideByPowerOf2(int x, int n) {
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
A simple alternative, assuming long long exceeds the range of int follows. I doubt this meets some corner of OP's goals, but OP's given goals encourages non-robust coding.
int divideByPowerOf2(int x, int n) {
long long ill = x;
if (x < 0) ill = -ill;
while (n--) ill >>= 1;
if (x < 0) ill = -ill;
return (int) ill;
}

Moving a "nibble" to the left using C

I've been working on this puzzle for awhile. I'm trying to figure out how to rotate 4 bits in a number (x) around to the left (with wrapping) by n where 0 <= n <= 31.. The code will look like:
moveNib(int x, int n){
//... some code here
}
The trick is that I can only use these operators:
~ & ^ | + << >>
and of them only a combination of 25. I also can not use If statements, loops, function calls. And I may only use type int.
An example would be moveNib(0x87654321,1) = 0x76543218.
My attempt: I have figured out how to use a mask to store the the bits and all but I can't figure out how to move by an arbitrary number. Any help would be appreciated thank you!

How about:
uint32_t moveNib(uint32_t x, int n) { return x<<(n<<2) | x>>((8-n)<<2); }
It uses <<2 to convert from nibbles to bits, and then shifts the bits by that much. To handle wraparound, we OR by a copy of the number which has been shifted by the opposite amount in the opposite direciton. For example, with x=0x87654321 and n=1, the left part is shifted 4 bits to the left and becomes 0x76543210, and the right part is shifted 28 bits to the right and becomes 0x00000008, and when ORed together, the result is 0x76543218, as requested.
Edit: If - really isn't allowed, then this will get the same result (assuming an architecture with two's complement integers) without using it:
uint32_t moveNib(uint32_t x, int n) { return x<<(n<<2) | x>>((9+~n)<<2); }
Edit2: OK. Since you aren't allowed to use anything but int, how about this, then?
int moveNib(int x, int n) { return (x&0xffffffff)<<(n<<2) | (x&0xffffffff)>>((9+~n)<<2); }
The logic is the same as before, but we force the calculation to use unsigned integers by ANDing with 0xffffffff. All this assumes 32 bit integers, though. Is there anything else I have missed now?
Edit3: Here's one more version, which should be a bit more portable:
int moveNib(int x, int n) { return ((x|0u)<<((n&7)<<2) | (x|0u)>>((9+~(n&7))<<2))&0xffffffff; }
It caps n as suggested by chux, and uses |0u to convert to unsigned in order to avoid the sign bit duplication you get with signed integers. This works because (from the standard):
Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.
Since int and 0u have the same rank, but 0u is unsigned, then the result is unsigned, even though ORing with 0 otherwise would be a null operation.
It then truncates the result to the range of a 32-bit int so that the function will still work if ints have more bits than this (though the rotation will still be performed on the lowest 32 bits in that case. A 64-bit version would replace 7 by 15, 9 by 17 and truncate using 0xffffffffffffffff).
This solution uses 12 operators (11 if you skip the truncation, 10 if you store n&7 in a variable).
To see what happens in detail here, let's go through it for the example you gave: x=0x87654321, n=1. x|0u results in a the unsigned number 0x87654321u. (n&7)<<2=4, so we will shift 4 bits to the left, while ((9+~(n&7))<<2=28, so we will shift 28 bits to the right. So putting this together, we will compute 0x87654321u<<4 | 0x87654321u >> 28. For 32-bit integers, this is 0x76543210|0x8=0x76543218. But for 64-bit integers it is 0x876543210|0x8=0x876543218, so in that case we need to truncate to 32 bits, which is what the final &0xffffffff does. If the integers are shorter than 32 bits, then this won't work, but your example in the question had 32 bits, so I assume the integer types are at least that long.
As a small side-note: If you allow one operator which is not on the list, the sizeof operator, then we can make a version that works with all the bits of a longer int automatically. Inspired by Aki, we get (using 16 operators (remember, sizeof is an operator in C)):
int moveNib(int x, int n) {
int nbit = (n&((sizeof(int)<<1)+~0u))<<2;
return (x|0u)<<nbit | (x|0u)>>((sizeof(int)<<3)+1u+~nbit);
}

Without the additional restrictions, the typical rotate_left operation (by 0 < n < 32) is trivial.
uint32_t X = (x << 4*n) | (x >> 4*(8-n));
Since we are talking about rotations, n < 0 is not a problem. Rotation right by 1 is the same as rotation left by 7 units. Ie. nn=n & 7; and we are through.
int nn = (n & 7) << 2; // Remove the multiplication
uint32_t X = (x << nn) | (x >> (32-nn));
When nn == 0, x would be shifted by 32, which is undefined. This can be replaced simply with x >> 0, i.e. no rotation at all. (x << 0) | (x >> 0) == x.
Replacing the subtraction with addition: a - b = a + (~b+1) and simplifying:
int nn = (n & 7) << 2;
int mm = (33 + ~nn) & 31;
uint32_t X = (x << nn) | (x >> mm); // when nn=0, also mm=0
Now the only problem is in shifting a signed int x right, which would duplicate the sign bit. That should be cured by a mask: (x << nn) - 1
int nn = (n & 7) << 2;
int mm = (33 + ~nn) & 31;
int result = (x << nn) | ((x >> mm) & ((1 << nn) + ~0));
At this point we have used just 12 of the allowed operations -- next we can start to dig into the problem of sizeof(int)...
int nn = (n & (sizeof(int)-1)) << 2; // etc.

Finding trailing 0s in a binary number

How to find number of trailing 0s in a binary number?Based on K&R bitcount example of finding 1s in a binary number i modified it a bit to find the trailing 0s.
int bitcount(unsigned x)
{
int b;
for(b=0;x!=0;x>>=1)
{
if(x&01)
break;
else
b++;
}
I would like to review this method.

Here's a way to compute the count in parallel for better efficiency:
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -signed(v);
if (v) c--;
if (v & 0x0000FFFF) c -= 16;
if (v & 0x00FF00FF) c -= 8;
if (v & 0x0F0F0F0F) c -= 4;
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;

On GCC on X86 platform you can use __builtin_ctz(no)
On Microsoft compilers for X86 you can use _BitScanForward
They both emit a bsf instruction

Another approach (I'm surprised it's not mentioned here) would be to build a table of 256 integers, where each element in the array is the lowest 1 bit for that index. Then, for each byte in the integer, you look up in the table.
Something like this (I haven't taken any time to tweak this, this is just to roughly illustrate the idea):
int bitcount(unsigned x)
{
static const unsigned char table[256] = { /* TODO: populate with constants */ };
for (int i=0; i<sizeof(x); ++i, x >>= 8)
{
unsigned char r = table[x & 0xff];
if (r)
return r + i*8; // Found a 1...
}
// All zeroes...
return sizeof(x)*8;
}
The idea with some of the table-driven approaches to a problem like this is that if statements cost you something in terms of branch prediction, so you should aim to reduce them. It also reduces the number of bit shifts. Your approach does an if statement and a shift per bit, and this one does one per byte. (Hopefully the optimizer can unroll the for loop, and not issue a compare/jump for that.) Some of the other answers have even fewer if statements than this, but a table approach is simple and easy to understand. Of course you should be guided by actual measurements to see if any of this matters.

I think your method is working (allthough you might want to use unsigned int). You check the last digit each time, and if it's zero, you discard it an increment the number of trailing zero-bits.
I think for trailing zeroes you don't need a loop.
Consider the following:
What happens with the number (in binary representation, of course) if you subtract 1? Which digits change, which stay the same?
How could you combine the original number and the decremented version such that only bits representing trailing zeroes are left?
If you apply the above steps correctly, you can just find the highest bit set in O(lg n) steps (look here if you're interested in how to do).

Should be:
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7; x>>=1)
{
if(x&1)
break;
else
b++;
}
return b;
}
or even
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); x>>=1) b++;
return b;
}
or even (yay!)
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); b++) x>>=1;
return b;
}
or ...
Ah, whatever, there are 100500 millions methods of doing this. Use whatever you need or like.

We can easily get it using bit operations, we don't need to go through all the bits. Pseudo code:
int bitcount(unsigned x) {
int xor = x ^ (x-1); // this will have (1 + #trailing 0s) trailing 1s
return log(i & xor); // i & xor will have only one bit 1 and its log should give the exact number of zeroes
}

int countTrailZero(unsigned x) {
if (x == 0) return DEFAULT_VALUE_YOU_NEED;
return log2 (x & -x);
}
Explanation:
x & -x returns the number of right most bit set with 1.
e.g. 6 -> "0000,0110", (6 & -6) -> "0000,0010"
You can deduct this by two complement:
x = "a1b", where b represents all trailing zeros.
then
-x = !(x) + 1 = !(a1b) + 1 = (!a)0(!b) + 1 = (!a)0(1...1) + 1 = (!a)1(0...0) = (!a)1b
so
x & (-x) = (a1b) & (!a)1b = (0...0)1(0...0)
you can get the number of trailing zeros just by doing log2.

mirror bits of a 32 bit word

How would you do that in C? (Example: 10110001 becomes 10001101 if we had to mirror 8 bits). Are there any instructions on certain processors that would simplify this task?

It's actually called "bit reversal", and is commonly done in FFT scrambling. The O(log N) way is (for up to 32 bits):
uint32_t reverse(uint32_t x, int bits)
{
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1); // Swap _<>_
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2); // Swap __<>__
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4); // Swap ____<>____
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8); // Swap ...
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16); // Swap ...
return x >> (32 - bits);
}
Maybe this small "visualization" helps:
An example of the first 3 assignment, with a uint8_t example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- -> <- -> <-
----> <---- ----> <----
----------> <----------
Well, if we're doing ASCII art, here's mine:
7 6 5 4 3 2 1 0
X X X X
6 7 4 5 2 3 0 1
\ X / \ X /
X X X X
/ X \ / X \
4 5 6 7 0 1 2 3
\ \ \ X / / /
\ \ X X / /
\ X X X /
X X X X
/ X X X \
/ / X X \ \
/ / / X \ \ \
0 1 2 3 4 5 6 7
It kind of looks like FFT butterflies. Which is why it pops up with FFTs.

Per Rich Schroeppel in this MIT memo (if you can read past the assembler), the following will reverse the bits in an 8bit byte providing that you have 64bit arithmetic available:
byte = (byte * 0x0202020202ULL & 0x010884422010ULL) % 1023;
Which sort of fans the bits out (the multiply), selects them (the and) and then shrinks them back down (the modulus).
Is it actually an 8bit quantity that you have?

Nearly a duplicate of Most Efficient Algorithm for Bit Reversal ( from MSB->LSB to LSB->MSB) in C (which has a lot of answers, including one AVX2 answer for reversing every 8-bit char in an array).
X86
On x86 with SSSE3 (Core2 and later, Bulldozer and later), pshufb (_mm_shuffle_epi8) can be used as a nibble LUT to do 16 lookups in parallel. You only need 8 lookups for the 8 nibbles in a single 32-bit integer, but the real problem is splitting the input bytes into separate nibbles (with their upper half zeroed). It's basically the same problem as for pshufb-based popcount.
avx2 register bits reverse shows how to do this for a packed vector of 32-bit elements. The same code ported to 128-bit vectors would compile just fine with AVX.
It's still good for a single 32-bit int because x86 has very efficient round-trip between integer and vector regs: int bitrev = _mm_cvtsi128_si32 ( rbit32( _mm_cvtsi32_si128(input) ) );. That only costs 2 extra movd instructions to get an integer from an integer register into XMM and back. (Round trip latency = 3 cycles on an Intel CPU like Haswell.)
ARM:
rbit has single-cycle latency, and does a whole 32-bit integer in one instruction.

Fastest approach is almost sure to be a lookup table:
out[0]=lut[in[3]];
out[1]=lut[in[2]];
out[2]=lut[in[1]];
out[3]=lut[in[0]];
Or if you can afford 128k of table data (by afford, I mean cpu cache utilization, not main memory or virtual memory utilization), use 16-bit units:
out[0]=lut[in[1]];
out[1]=lut[in[0]];

The naive / slow / simple way is to extract the low bit of the input and shift it into another variable that accumulates a return value.
#include <stdint.h>
uint32_t mirror_u32(uint32_t input) {
uint32_t returnval = 0;
for (int i = 0; i < 32; ++i) {
int bit = input & 0x01;
returnval <<= 1;
returnval += bit; // Shift the isolated bit into returnval
input >>= 1;
}
return returnval;
}
For other types, the number of bits of storage is sizeof(input) * CHAR_BIT, but that includes potential padding bits that aren't part of the value. The fixed-width types are a good idea here.
The += instead of |= makes gcc compile it more efficiently for x86 (using x86's shift-and-add instruction, LEA). Of course, there are much faster ways to bit-reverse; see the other answers. This loop is good for small code size (no large masks), but otherwise pretty much no advantage.
Compilers unfortunately don't recognize this loop as a bit-reverse and optimize it to ARM rbit or whatever. (See it on the Godbolt compiler explorer)

If you are interested in a more embedded approach, when I worked with an armv7a system, I found the RBIT command.
So within a C function using GNU extended asm I could use:
uint32_t bit_reverse32(uint32_t inp32)
{
uint32_t out = 0;
asm("RBIT %0, %1" : "=r" (out) : "r" (inp32));
return out;
}
There are compilers which expose intrinsic C wrappers like this. (armcc __rbit) and gcc also has some intrinsic via ACLE but with gcc-arm-linux-gnueabihf I could not find __rbit C so I came up with the upper code.
I didn't look, but I suppose on other platforms you could create similar solutions.

I've also just figured out a minimal solution for mirroring 4 bits (a nibble) in only 16 bits temporary space.
mirr = ( (orig * 0x222) & 0x1284 ) % 63

I think I would make a lookup table of bitpatterns 0-255. Read each byte and with the lookup table reverse that byte and afterwards arrange the resulting bytes appropriately.

quint64 mirror(quint64 a,quint8 l=64) {
quint64 b=0;
for(quint8 i=0;i<l;i++) {
b|=(a>>(l-i-1))&((quint64)1<<i);
}
return b;
}
This function mirroring less then 64 bits. For instance it can mirroring 12 bits.
quint64 and quint8 are defined in Qt. But it possible redefine it in anyway.

If you have been staring at Mike DeSimone's great answer (like me), here is a "visualization" on the first 3 assignment, with a uint8_t example:
b7 b6 b5 b4 b3 b2 b1 b0
-> <- -> <- <- -> <- ->
----> <---- ----> <----
----------> <----------
So first, bitwise swap, then "two-bit-group" swap and so on.

For sure most people won't consider my approach neither as elegant nor efficient: it's aimed at being portable and somehow "straightforward".
#include <limits.h> // CHAR_BIT
unsigned bit_reverse( unsigned s ) {
unsigned d;
int i;
for( i=CHAR_BIT*sizeof( unsigned ),d=0; i; s>>=1,i-- ) {
d <<= 1;
d |= s&1;
}
return d;
}
This function pulls the least significant bit from the source bistring s and pushes it as the most significant bit in the destination bitstring d.
You can replace unsigned data type with whatever suits your case, from unsigned char (CHAR_BIT bits, usually 8) to unsigned long long (128 bits in modern 64-bit CPUs).
Of course, there can be CPU-specific instructions (or instruction sets) that could be used instead of my plain C code.
But than that wouldn't be "C language" but rather assembly instruction(s) in a C wrapper.

int mirror (int input)
{// return bit mirror of 8 digit number
int tmp2;
int out=0;
for (int i=0; i<8; i++)
{
out = out << 1;
tmp2 = input & 0x01;
out = out | tmp2;
input = input >> 1;
}
return out;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight