How can i change mod with bitwise AND [duplicate] - c

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.

Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses.
Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in eax and quotient returned in eax)
div10 proc
mov edx,1999999Ah ; load 1/10 * 2^32
imul eax ; edx:eax = dividend / 10 * 2 ^32
mov eax,edx ; eax = dividend / 10
ret
endp
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.

Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I think that this is the best solution for architectures that lack a multiply instruction.

Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
temp = (ms * 205) >> 11; // 205/2048 is nearly the same as /10

to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:
y = (ms * 1) >> 3 // first error 8
y = (ms * 2) >> 4 // 8
y = (ms * 4) >> 5 // 8
y = (ms * 7) >> 6 // 19
y = (ms * 13) >> 7 // 69
y = (ms * 26) >> 8 // 69
y = (ms * 52) >> 9 // 69
y = (ms * 103) >> 10 // 179
y = (ms * 205) >> 11 // 1029
y = (ms * 410) >> 12 // 1029
y = (ms * 820) >> 13 // 1029
y = (ms * 1639) >> 14 // 2739
y = (ms * 3277) >> 15 // 16389
y = (ms * 6554) >> 16 // 16389
y = (ms * 13108) >> 17 // 16389
y = (ms * 26215) >> 18 // 43699
y = (ms * 52429) >> 19 // 262149
y = (ms * 104858) >> 20 // 262149
y = (ms * 209716) >> 21 // 262149
y = (ms * 419431) >> 22 // 699059
y = (ms * 838861) >> 23 // 4194309
y = (ms * 1677722) >> 24 // 4194309
y = (ms * 3355444) >> 25 // 4194309
y = (ms * 6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869
each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.
if you want to calculate more of these, the following (Python) code can be used:
def mul_from_shift(shift):
mid = 2**shift + 5.
return int(round(mid / 10.))
and I did the obvious thing for calculating when this approximation starts to go wrong with:
def first_err(mul, shift):
i = 1
while True:
y = (i * mul) >> shift
if y != i // 10:
return i
i += 1
(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)
the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:
where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1

Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find x where x = v / 10.
We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.
Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).
We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.
This yields the following algorithm to divide v by 10.
uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
uint16_t t = x10 + b10;
if (t <= v) {
x10 = t;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes
uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
if (b10 <= v) {
v -= b10;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
The loop may be unwinded and the different values of b and b10 may be precomputed as constants.

On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
uint16_t div10(uint16_t dividend) {
uint16_t quotient = 0;
#define div10_step(n) \
do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
div10_step(0x1000);
div10_step(0x0800);
div10_step(0x0400);
div10_step(0x0200);
div10_step(0x0100);
div10_step(0x0080);
div10_step(0x0040);
div10_step(0x0020);
div10_step(0x0010);
div10_step(0x0008);
div10_step(0x0004);
div10_step(0x0002);
div10_step(0x0001);
#undef div10_step
if (dividend >= 5) ++quotient; // round the result (optional)
return quotient;
}

Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.

I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc

elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"

Related

Can we use bitwise operator for conversion from decimal to other bases other than 4, 8 , 16 and so on? In C

Can we use bitwise operator for conversion from decimal to other bases other than 4, 8, 16 and so on?
I understand how to do that for 4, 8, 16 and so on.
But for conversion from decimal to base 3, or base 12, for example, I don't know.
It is possible?
I assume in your question you mend conversion from binary to other bases.
All arithmetic operations can be reduced to bitwise operations and shifts. That's what the CPU is doing internally in hardware too.
a + b ==> (a ^ b) + ((a & b) << 1)
The right side still has a + in there so you have to apply the same transformation again and again till you have a left shift larger than the width of your integer type. Or do it bit by bit in a loop.
With two's-complement:
-a ==> ~a + 1
And if you have + and negate you have -. * is just a bunch of shifts and adds. / is a bunch of shifts and subtract. Just consider how you did multiplication and long division in school and bring that down to base 2.
For most bases doing the math with bitwise operations is insane. Especially if you derive your code from the basic operations above. The CPUs add, sub and mul operations are just fine and way faster. But if you want to implement printf() for a freestanding environment (like a kernel) you might need to do a division of uint64_t / 10 that your CPU can't do in hardware. The compiler (gcc, clang) also isn't smart enough do this well and falls back to a general iterative uint64_t / uint64_t long division algorithm.
But a division can be done by multiplying by the inverse shifted a few bits and the shifting the result back. This method works out really well for a division by 10 and you get nicely optimized code:
uint64_t divu10(uint64_t n) {
uint64_t q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q + (q >> 32);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
That's is shorter and faster by a magnitude or two to the general uint64_t / uint64_t long division function that gcc / clang will call when you write x / 10.
Note: (((q << 2) + q) << 1) is q * 10. Another bitwise operation that is faster than q * 10 when the cpu doesn't have 64bit integers.

Efficient modulo-255 computation

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.
The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.
The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.
A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20
showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.
An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.
Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.
Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NAIVE_USING_DIV (1)
#define DIGIT_SUM_THEN_FOLD (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC (6) // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1 (7) // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2 (8) // Hacker's Delight, 2nd ed., p. 272
#define VARIANT (WARREN_MUL_SHR_2)
uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
x = (x & 0xffff) + (x >> 16);
x = (x & 0xff) + (x >> 8);
x = (x & 0xff) + (x >> 8) + 1;
x = (x & 0xff) + (x >> 8) - 1;
return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
if (t == 255) t = 0;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x) + 1;
t = (t & 0xff) + (t >> 8) - 1;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
t = t & ((t - 255) >> 8);
return t;
#elif VARIANT == KEANE_MAGIC
x = (((x >> 16) + x) >> 14) + (x << 2);
x = ((x >> 8) + x + 2) & 0x3ff;
x = (x - (x >> 8)) >> 2;
return x;
#elif VARIANT == WARREN_MUL_SHR_1
x = (0x01010101 * x + (x >> 8)) >> 24;
x = x & ((x - 255) >> 8);
return x;
#elif VARIANT == WARREN_MUL_SHR_2
x = (0x01010101 * x + (x >> 8)) >> 24;
if (x == 255) x = 0;
return x;
#else
#error unknown VARIANT
#endif
}
uint32_t ref_mod255 (uint32_t x)
{
volatile uint32_t t = x;
t = t % 255;
return t;
}
// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
int main (void)
{
double start, stop;
uint32_t res, ref, x = 0;
printf ("Testing VARIANT = %d\n", VARIANT);
start = second();
do {
res = mod255 (x);
ref = ref_mod255 (x);
if (res != ref) {
printf ("error # %08x: res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
stop = second();
printf ("test passed\n");
printf ("elapsed = %.6f seconds\n", stop - start);
return EXIT_SUCCESS;
}
For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:
quotient = x / n;
product = quotient * n;
modulus = x - product;
However, when n is a power of 2 (n = 2p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.
On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).
Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1)†. Thus, our conceptual operations are now: division, addition and masking.
In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.
Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):
unsigned Naive255(unsigned x)
{
return x % 255;
}
mov edx, ecx
mov eax, 2155905153 ;
imul rax, rdx ; Replacing the IDIV with IMUL and SHR
shr rax, 39 ;
mov edx, eax
shl edx, 8
sub eax, edx
add eax, ecx
And here's the (clearly faster) code generated using the 'trick' described above:
unsigned Trick255(unsigned x)
{
return (x + x / 255) & 0xFF;
}
mov eax, ecx
mov edx, 2155905153
imul rdx, rax
shr rdx, 39
add edx, ecx
movzx eax, dl ; Faster than an explicit AND mask?
Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.
† The answer given by David Eisenstat explains how/why this equivalence is valid.
Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.
Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.
Adrian Mole’s method
This method evaluates (x + ⌊x/255⌋) mod 28 (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 28 = (28 q + r) mod 28 = r.
Henry S. Warren’s method
Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(28/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (20 + 2−8 + 2−16 + 2−24 + 2−32) instead of 28/255, which is the sum of the infinite series 20 + 2−8 + 2−16 + 2−24 + 2−32 + …. Since the approximation is slightly under, this method must detect the residue 28 − 1 = 255.
Joe Keane’s method
The intuition for this method is to compute y = (28/255) x mod 28, which equals (28/255) (255 q + r) mod 28 = (28 q + (28/255) r) mod 28 = (28/255) r, and return y − y/28, which equals r.
Since these formulas don’t use the fact that ⌊(28/255) r⌋ = r, Keane can switch from 28 to 210 for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 210/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.
This method sort of uses the multiplier 22 (20 + 2−8 + 2−16 + 2−24 + 2−32 + 2−40) = 22 (20 + 2−16 + 2−32) (20 + 2−8). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊22 (20 + 2−16 + 2−32) x⌋ mod 232. Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(20 + 2−8) x′⌋ mod 210.
I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2−8; the third, width < ((2 − 2−8) + 1)/22 < 1, which allows correct rounding.
Regarding improvements, the 2−40 term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2−32 term. Dropping 2−32 pushes the approximation quality out of spec.
Guess you're probably not looking for solutions that require fast 64-bit multiplication, but for the record:
return (x * 0x101010101010102ULL) >> 56;
This method (improved slightly since the previous edit) mashes up Warren and Keane. On my laptop, it’s faster than Keane but not as fast as a 64-bit multiply and shift. It avoids multiplication but benefits from a single rotate instruction. Unlike the original version, it’s probably OK on RISC-V.
Like Warren, this method approximates ⌊(256/255) x mod 256⌋ in 8.24 fixed point. Mod 256, each byte b contributes a term (256/255) b, which is approximately b.bbb base 256. The original version of this method just sums all four byte rotations. (I’ll get to the revised version in a moment.) This sum always underestimates the real value, but by less than 4 units in the last place. By adding 4/2−24 before truncating, we guarantee the right answer as in Keane.
The revised version saves work by relaxing the approximation quality. We write (256/255) x = (257/256) (65536/65535) x, evaluate (65536/65535) x in 16.16 fixed point (i.e., add x to its 16-bit rotation), and then multiply by 257/256 and mod by 256 into 8.24 fixed point. The first multiplication has error less than 2 units in the last place of 16.16, and the second is exact (!). The sum underestimates by less than (2/216) (257/256), so a constant term of 514/224 suffices to fix the truncation. It’s also possible to use a greater value in case a different immediate operand is more efficient.
uint32_t mod255(uint32_t x) {
x += (x << 16) | (x >> 16);
return ((x << 8) + x + 514) >> 24;
}
If we were to have a builtin, intrinsic, or method that is optimised to single instruction addc, one could use 32-bit arithmetic in the following way:
uint32_t carry = 0;
// sum up top and bottom 16 bits while generating carry out
x = __builtin_addc(x, x<<16, carry, &carry);
x &= 0xffff0000;
// store the previous carry to bit 0 while adding
// bits 16:23 over bits 24:31, and producing one more carry
x = __builtin_addc(x, x << 8, carry, &carry);
x = __builtin_addc(x, x >> 24, carry, &carry);
x &= 0x0000ffff; // actually 0x1ff is enough
// final correction for 0<=x<=257, i.e. min(x,x-255)
x = x < x-255 ? x : x - 255;
In Arm64 at least the regular add instruction can take the form of add r0, r1, r2 LSL 16; the masking with immediate or clearing consecutive bits is a single instruction bfi r0, wzr, #start_bit, #length.
For parallel calculation one can't use that efficiently widening multiplication. Instead one can divide-and-conquer while calculating carries -- starting with 16 uint32_t elements interpreted as 16+16 uint16_t elements, then moving to uint8_t arithmetic, one can calculate one result in slightly less than one instruction.
a0 = vld2q_u16(ptr); // split input to top16+bot16 bits
a1 = vld2q_u16(ptr + 8); // load more inputs
auto b0 = vaddq_u16(a0.val[0], a0.val[1]);
auto b1 = vaddq_u16(a1.val[0], a1.val[1]);
auto c0 = vcltq_u16(b0, a0.val[1]); // 8 carries
auto c1 = vcltq_u16(b1, a1.val[1]); // 8 more carries
b0 = vsubq_u16(b0, c0);
b1 = vsubq_u16(b1, c1);
auto d = vuzpq_u8(b0, b1);
auto result = vaddq_u8(d.val[0], d.val[1]);
auto carry = vcltq_u8(result, d.val[1]);
result = vsubq_u8(result, carry);
auto is_255 = vceqq_u8(result, vdupq_n_u8(255));
result = vbicq_u8(result, is_255);

simulate jg instruction(datalab's isGreater)

I am doing CSAPP's datalab, the isGreater function.
Here's the description
isGreater - if x > y then return 1, else return 0
Example: isGreater(4,5) = 0, isGreater(5,4) = 1
Legal ops: ! ~ & ^ | + << >>
Max ops: 24
Rating: 3
x and y are both int type.
So i consider to simulate the jg instruction to implement it.Here's my code
int isGreater(int x, int y)
{
int yComplement = ~y + 1;
int minusResult = x + yComplement; // 0xffffffff
int SF = (minusResult >> 31) & 0x1; // 1
int ZF = !minusResult; // 0
int xSign = (x >> 31) & 0x1; // 0
int ySign = (yComplement >> 31) & 0x1; // 1
int OF = !(xSign ^ ySign) & (xSign ^ SF); // 0
return !(OF ^ SF) & !ZF;
}
The jg instruction need SF == OF and ZF == 0.
But it can't pass a special case, that is, x = 0x7fffffff(INT_MAX), y = 0x80000000(INT_MIN).
I deduce it like this:
x + yComplement = 0xffffffff, so SF = 1, ZF = 0, since xSign != ySign, the OF is set to 0.
So, what's wrong with my code, is my OF setting operation wrong?
You're detecting overflow in the addition x + yComplement, rather than in the overall subtraction
-INT_MIN itself overflows in 2's complement; INT_MIN == -INT_MIN. This is the 2's complement anomaly1.
You should be getting fast-positive overflow detection for any negative number (other than INT_MIN) minus INT_MIN. The resulting addition will have signed overflow. e.g. -10 + INT_MIN overflows.
http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt has a table of input/output signs for add and subtraction. The cases that overflow are where the inputs signs are opposite but the result sign matches y.
SUBTRACTION SIGN BITS (for num1 - num2 = sum)
num1sign num2sign sumsign
---------------------------
0 0 0
0 0 1
0 1 0
*OVER* 0 1 1 (subtracting a negative is the same as adding a positive)
*OVER* 1 0 0 (subtracting a positive is the same as adding a negative)
1 0 1
1 1 0
1 1 1
You could use this directly with the original x and y, and only use yComplement as part of getting the minusResult. Adjust your logic to match this truth table.
Or you could use int ySign = (~y) >> 31; and leave the rest of your code unmodified. (Use a tmp to hold ~y so you only do the operation once, for this and yComplement). The one's complement inverse (~) does not suffer from the 2's complement anomaly.
Footnote 1: sign/magnitude and one's complement have two redundant ways to represent 0, instead of an value with no inverse.
Fun fact: if you make an integer absolute-value function, you should consider the result unsigned to avoid this problem. int can't represent the absolute value of INT_MIN.
Efficiency improvements:
If you use unsigned int, you don't need & 1 after a shift because logical shifts don't sign-extend. (And as a bonus, it would avoid C signed-overflow undefined behaviour in +: http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html).
Then (if you used uint32_t, or sizeof(unsigned) * CHAR_BIT instead of 31) you'd have a safe and portable implementation of 2's complement comparison. (signed shift semantics for negative numbers are implementation-defined in C.) I think you're using C as a sort of pseudo-code for bit operations, and aren't interested in actually writing a portable implementation, and that's fine. The way you're doing things will work on normal compilers on normal CPUs.
Or you can use & 0x80000000 to leave the high bits in place (but then you'd have to left shift your ! result).
It's just the lab's restriction, you can't use unsigned or any constant larger than 0xff(255)
Ok, so you don't have access to logical right shift. Still, you need at most one &1. It's ok to work with numbers where all you care about is the low bit, but where the rest hold garbage.
You eventually do & !ZF, which is either &0 or &1. Thus, any high garbage in OF` is wiped away.
You can also delay the >> 31 until after XORing together two numbers.
This is a fun problem that I want to optimize myself:
// untested, 13 operations
int isGreater_optimized(int x, int y)
{
int not_y = ~y;
int minus_y = not_y + 1;
int sum = x + minus_y;
int x_vs_y = x ^ y; // high bit = 1 if they were opposite signs: OF is possible
int x_vs_sum = x ^ sum; // high bit = 1 if they were opposite signs: OF is possible
int OF = (x_vs_y & x_vs_sum) >> 31; // high bits hold garbage
int SF = sum >> 31;
int non_zero = !!sum; // 0 or 1
return (~(OF ^ SF)) & non_zero; // high garbage is nuked by `& 1`
}
Note the use of ~ instead of ! to invert a value that has high garbage.
It looks like there's still some redundancy in calculating OF separately from SF, but actually the XORing of sum twice doesn't cancel out. x ^ sum is an input for &, and we XOR with sum after that.
We can delay the shifts even later, though, and I found some more optimizations by avoiding an extra inversion. This is 11 operations
// replace 31 with sizeof(int) * CHAR_BIT if you want. #include <limit.h>
// or use int32_t
int isGreater_optimized2(int x, int y)
{
int not_y = ~y;
int minus_y = not_y + 1;
int sum = x + minus_y;
int SF = sum; // value in the high bit, rest are garbage
int x_vs_y = x ^ y; // high bit = 1 if they were opposite signs: OF is possible
int x_vs_sum = x ^ sum; // high bit = 1 if they were opposite signs: OF is possible
int OF = x_vs_y & x_vs_sum; // low bits hold garbage
int less = (OF ^ SF);
int ZF = !sum; // 0 or 1
int le = (less >> 31) & ZF; // clears high garbage
return !le; // jg == jnle
}
I wondered if any compilers might see through this manual compare and optimize it into cmp edi, esi/ setg al, but no such luck :/ I guess that's not a pattern that they look for, because code that could have been written as x > y tends to be written that way :P
But anyway, here's the x86 asm output from gcc and clang on the Godbolt compiler explorer.
Assuming two's complement, INT_MIN's absolute value isn't representable as an int. So, yComplement == y (ie. still negative), and ySign is 1 instead of the desired 0.
You could instead calculate the sign of y like this (changing as little as possible in your code) :
int ySign = !((y >> 31) & 0x1);
For a more detailed analysis, and a more optimal alternative, check Peter Cordes' answer.

Divide a signed integer by a power of 2

I'm working on a way to divide a signed integer by a power of 2 using only binary operators (<< >> + ^ ~ & | !), and the result has to be round toward 0. I came across this question also on Stackoverflow on the problem, however, I cannot understand why it works. Here's the solution:
int divideByPowerOf2(int x, int n)
{
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
I understand the x >> 31 part (only add the next part if x is negative, because if it's positive x will be automatically round toward 0). But what's bothering me is the (1 << n) + ~0 part. How can it work?
Assuming 2-complement, just bit-shifting the dividend is equivalent to a certain kind of division: not the conventional division where we round the dividend to next multiple of divisor toward zero. But another kind where we round the dividend toward negative infinity. I rediscovered that in Smalltalk, see http://smallissimo.blogspot.fr/2015/03/is-bitshift-equivalent-to-division-in.html.
For example, let's divide -126 by 8. traditionally, we would write
-126 = -15 * 8 - 6
But if we round toward infinity, we get a positive remainder and write it:
-126 = -16 * 8 + 2
The bit-shifting is performing the second operation, in term of bit patterns (assuming 8 bits long int for the sake of being short):
1000|0010 >> 3 = 1111|0000
1000|0010 = 1111|0000 * 0000|1000 + 0000|0010
So what if we want the traditional division with quotient rounded toward zero and remainder of same sign as dividend? Simple, we just have to add 1 to the quotient - if and only if the dividend is negative and the division is inexact.
You saw that x>>31 corresponds to first condition, dividend is negative, assuming int has 32 bits.
The second term corresponds to the second condition, if division is inexact.
See how are encoded -1, -2, -4, ... in two complement: 1111|1111 , 1111|1110 , 1111|1100. So the negation of nth power of two has n trailing zeros.
When the dividend has n trailing zeros and we divide by 2^n, then no need to add 1 to final quotient. In any other case, we need to add 1.
What ((1 << n) + ~0) is doing is creating a mask with n trailing ones.
The n last bits don't really matter, because we are going to shift to the right and just throw them away. So, if the division is exact, the n trailing bits of dividend are zero, and we just add n 1s that will be skipped. On the contrary, if the division is inexact, then one or more of the n trailing bits of the dividend is 1, and we are sure to cause a carry to the n+1 bit position: that's how we add 1 to the quotient (we add 2^n to the dividend). Does that explain it a bit more?
This is "write-only code": instead of trying to understand the code, try to create it by yourself.
For example, let's divide a number by 8 (shift right by 3).
If the number is negative, the normal right-shift rounds in the wrong direction. Let's "fix" it by adding a number:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + whatever) >> 3;
}
Here you can come up with a mathematical formula for whatever, or do some trial and error. Anyway, here whatever = 7:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + 7) >> 3;
}
How to unify the two cases? You need to make an expression that looks like this:
(x + stuff) >> 3
where stuff is 7 for negative x, and 0 for positive x. The trick here is using x >> 31, which is a 32-bit number whose bits are equal to the sign-bit of x: all 0 or all 1. So stuff is
(x >> 31) & 7
Combining all these, and replacing 8 and 7 by the more general power of 2, you get the code you asked about.
Note: in the description above, I assume that int represents a 32-bit hardware register, and hardware uses two's complement representation to do right shift.
OP's reference is of a C# code and so many subtle differences that cause it to be bad code with C, as this post is tagged.
int is not necessarily 32-bits so using a magic number of 32 does not make for a robust solution.
In particular (1 << n) + ~0 results in implementation defined behavior when n causes a bit to be shifted into the sign place. Not good coding.
Restricting code to only using "binary" operators << >> + ^ ~ & | ! encourages a coder to assume things about int which is not portable nor compliant with the C spec. So OP's posted code does not "work" in general, although may work in many common implementations.
OP code fails when int is not 2's complement, not uses the range [-2147483648 .. 2147483647] or when 1 << n uses implementation behavior that is not as expected.
// weak code
int divideByPowerOf2(int x, int n) {
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
A simple alternative, assuming long long exceeds the range of int follows. I doubt this meets some corner of OP's goals, but OP's given goals encourages non-robust coding.
int divideByPowerOf2(int x, int n) {
long long ill = x;
if (x < 0) ill = -ill;
while (n--) ill >>= 1;
if (x < 0) ill = -ill;
return (int) ill;
}

How to multiply a given number by 2 without using arithmetic operators in c?

How to multiply a given number by 2 without using arithmetic operators in c language?
Use bit wise << operator:
x = x << 1;
This works for integer and long numbers (not floating point numbers).
It basically shifts the binary contents one position to the left, which is equivalent to multiplying by 2
Left shift.
But why would you want to do that? Leave that kind of optimization to the compiler.
List of operators and plenty of examples on wikipedia.
Just to extend on kgiannakakis post:
The shift operator << works because it shifts at the binary level - effectively in base 2. Just as moving a number by one place to the left in decimal (base 10) is the same as multiplying by 10. For example:
23 (move to left one digit) = 230 = 23 * 10
For the example (using the prefix 0b to represent binary numbers):
3 * 2 = 6
0b11 << 1 = 0b110
Shifting by other numbers is equivalent to multiplying by 2 'n' times, or multiplying by 2 to the nth power (2^n)
3 * 8 = 3 * 2^3 = 24
0b11 * 8 = 0b11 << 3 = 0b11000 (note the trailing zeros)
And an example in decimal to finish it off:
23 (move to left 3 places) = 23 * 1000 = 23 * 10^3 = 23000
Caution: Shifting might not work for signed variables,
int x;
int tmp;
if (x < 0) {
tmp = -x;
tmp = tmp << 1;
x = -tmp;
}
else {
x = x << 1;
}
Just to make sure we are all clear, to do this as everyone has suggested, bitwise shift left (<<)
Every time you shift left, you are multiplying by a factor of two essentially.
int x;
int result;
x = 5;
// result = 10 (multiply by a factor of 2 or 2^1=2)
result = 5<<1;
//result = 20 (multiply by a factor of 4 or 2^2=4)
result = 5<<2;
//result = 40 (multiply by a factor of 8 or 2^3=8)
result = 5<<3;

Resources