How to compute 2⁶⁴/n in C?

How to compute 2⁶⁴/n in C? - c

How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?

I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret

I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal

We use a 64-bit CPU
Which 64-bit CPU?
In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.
Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".
In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.
Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".
Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.
Hopefully you can see how this situation might be considered "less than ideal".
What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:
__uint128_t numerator = (__uint128_t)1 << 64;
if(n > 1) {
return (uint64_t)(numerator/n);
}
I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).
In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.
For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
...and the best code that's possible is:
mov edx, 1
xor rax, rax
div rdi
ret

Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.

Related

How can I quickly get the value 2^64 divided by random integer in C lang? [duplicate]

How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?

I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret

I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal

Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.

Efficient modulo-255 computation

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.
The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.
The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.
A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20
showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.
An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.
Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.
Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NAIVE_USING_DIV (1)
#define DIGIT_SUM_THEN_FOLD (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC (6) // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1 (7) // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2 (8) // Hacker's Delight, 2nd ed., p. 272
#define VARIANT (WARREN_MUL_SHR_2)
uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
x = (x & 0xffff) + (x >> 16);
x = (x & 0xff) + (x >> 8);
x = (x & 0xff) + (x >> 8) + 1;
x = (x & 0xff) + (x >> 8) - 1;
return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
if (t == 255) t = 0;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x) + 1;
t = (t & 0xff) + (t >> 8) - 1;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
t = t & ((t - 255) >> 8);
return t;
#elif VARIANT == KEANE_MAGIC
x = (((x >> 16) + x) >> 14) + (x << 2);
x = ((x >> 8) + x + 2) & 0x3ff;
x = (x - (x >> 8)) >> 2;
return x;
#elif VARIANT == WARREN_MUL_SHR_1
x = (0x01010101 * x + (x >> 8)) >> 24;
x = x & ((x - 255) >> 8);
return x;
#elif VARIANT == WARREN_MUL_SHR_2
x = (0x01010101 * x + (x >> 8)) >> 24;
if (x == 255) x = 0;
return x;
#else
#error unknown VARIANT
#endif
}
uint32_t ref_mod255 (uint32_t x)
{
volatile uint32_t t = x;
t = t % 255;
return t;
}
// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
int main (void)
{
double start, stop;
uint32_t res, ref, x = 0;
printf ("Testing VARIANT = %d\n", VARIANT);
start = second();
do {
res = mod255 (x);
ref = ref_mod255 (x);
if (res != ref) {
printf ("error # %08x: res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
stop = second();
printf ("test passed\n");
printf ("elapsed = %.6f seconds\n", stop - start);
return EXIT_SUCCESS;
}

For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:
quotient = x / n;
product = quotient * n;
modulus = x - product;
However, when n is a power of 2 (n = 2p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.
On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).
Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1)†. Thus, our conceptual operations are now: division, addition and masking.
In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.
Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):
unsigned Naive255(unsigned x)
{
return x % 255;
}
mov edx, ecx
mov eax, 2155905153 ;
imul rax, rdx ; Replacing the IDIV with IMUL and SHR
shr rax, 39 ;
mov edx, eax
shl edx, 8
sub eax, edx
add eax, ecx
And here's the (clearly faster) code generated using the 'trick' described above:
unsigned Trick255(unsigned x)
{
return (x + x / 255) & 0xFF;
}
mov eax, ecx
mov edx, 2155905153
imul rdx, rax
shr rdx, 39
add edx, ecx
movzx eax, dl ; Faster than an explicit AND mask?
Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.
† The answer given by David Eisenstat explains how/why this equivalence is valid.

Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.
Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.
Adrian Mole’s method
This method evaluates (x + ⌊x/255⌋) mod 28 (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 28 = (28 q + r) mod 28 = r.
Henry S. Warren’s method
Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(28/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (20 + 2−8 + 2−16 + 2−24 + 2−32) instead of 28/255, which is the sum of the infinite series 20 + 2−8 + 2−16 + 2−24 + 2−32 + …. Since the approximation is slightly under, this method must detect the residue 28 − 1 = 255.
Joe Keane’s method
The intuition for this method is to compute y = (28/255) x mod 28, which equals (28/255) (255 q + r) mod 28 = (28 q + (28/255) r) mod 28 = (28/255) r, and return y − y/28, which equals r.
Since these formulas don’t use the fact that ⌊(28/255) r⌋ = r, Keane can switch from 28 to 210 for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 210/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.
This method sort of uses the multiplier 22 (20 + 2−8 + 2−16 + 2−24 + 2−32 + 2−40) = 22 (20 + 2−16 + 2−32) (20 + 2−8). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊22 (20 + 2−16 + 2−32) x⌋ mod 232. Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(20 + 2−8) x′⌋ mod 210.
I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2−8; the third, width < ((2 − 2−8) + 1)/22 < 1, which allows correct rounding.
Regarding improvements, the 2−40 term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2−32 term. Dropping 2−32 pushes the approximation quality out of spec.

Guess you're probably not looking for solutions that require fast 64-bit multiplication, but for the record:
return (x * 0x101010101010102ULL) >> 56;

This method (improved slightly since the previous edit) mashes up Warren and Keane. On my laptop, it’s faster than Keane but not as fast as a 64-bit multiply and shift. It avoids multiplication but benefits from a single rotate instruction. Unlike the original version, it’s probably OK on RISC-V.
Like Warren, this method approximates ⌊(256/255) x mod 256⌋ in 8.24 fixed point. Mod 256, each byte b contributes a term (256/255) b, which is approximately b.bbb base 256. The original version of this method just sums all four byte rotations. (I’ll get to the revised version in a moment.) This sum always underestimates the real value, but by less than 4 units in the last place. By adding 4/2−24 before truncating, we guarantee the right answer as in Keane.
The revised version saves work by relaxing the approximation quality. We write (256/255) x = (257/256) (65536/65535) x, evaluate (65536/65535) x in 16.16 fixed point (i.e., add x to its 16-bit rotation), and then multiply by 257/256 and mod by 256 into 8.24 fixed point. The first multiplication has error less than 2 units in the last place of 16.16, and the second is exact (!). The sum underestimates by less than (2/216) (257/256), so a constant term of 514/224 suffices to fix the truncation. It’s also possible to use a greater value in case a different immediate operand is more efficient.
uint32_t mod255(uint32_t x) {
x += (x << 16) | (x >> 16);
return ((x << 8) + x + 514) >> 24;
}

If we were to have a builtin, intrinsic, or method that is optimised to single instruction addc, one could use 32-bit arithmetic in the following way:
uint32_t carry = 0;
// sum up top and bottom 16 bits while generating carry out
x = __builtin_addc(x, x<<16, carry, &carry);
x &= 0xffff0000;
// store the previous carry to bit 0 while adding
// bits 16:23 over bits 24:31, and producing one more carry
x = __builtin_addc(x, x << 8, carry, &carry);
x = __builtin_addc(x, x >> 24, carry, &carry);
x &= 0x0000ffff; // actually 0x1ff is enough
// final correction for 0<=x<=257, i.e. min(x,x-255)
x = x < x-255 ? x : x - 255;
In Arm64 at least the regular add instruction can take the form of add r0, r1, r2 LSL 16; the masking with immediate or clearing consecutive bits is a single instruction bfi r0, wzr, #start_bit, #length.
For parallel calculation one can't use that efficiently widening multiplication. Instead one can divide-and-conquer while calculating carries -- starting with 16 uint32_t elements interpreted as 16+16 uint16_t elements, then moving to uint8_t arithmetic, one can calculate one result in slightly less than one instruction.
a0 = vld2q_u16(ptr); // split input to top16+bot16 bits
a1 = vld2q_u16(ptr + 8); // load more inputs
auto b0 = vaddq_u16(a0.val[0], a0.val[1]);
auto b1 = vaddq_u16(a1.val[0], a1.val[1]);
auto c0 = vcltq_u16(b0, a0.val[1]); // 8 carries
auto c1 = vcltq_u16(b1, a1.val[1]); // 8 more carries
b0 = vsubq_u16(b0, c0);
b1 = vsubq_u16(b1, c1);
auto d = vuzpq_u8(b0, b1);
auto result = vaddq_u8(d.val[0], d.val[1]);
auto carry = vcltq_u8(result, d.val[1]);
result = vsubq_u8(result, carry);
auto is_255 = vceqq_u8(result, vdupq_n_u8(255));
result = vbicq_u8(result, is_255);

Unsigned short int operation with Intel Intrinsics

I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
}
else
{
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
}
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
}

If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
{
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
}
compiles to this asm with gcc and clang (Godbolt compiler explorer):
div13_epu16:
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
ret
.section .rodata
.LC0:
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.

Strange behavior of bit operation in C

I'm studying C programming language and its bit operators.
I've written codes like below and I expected that the result of the codes are same.
But the reality is not.
#include <stdio.h>
#define N 0
int main() {
int n = 0;
printf("%d\n", ~0x00 + (0x01 << (0x20 + (~n + 1))));
printf("%d\n", ~0x00 + (0x01 << (0x20 + (~N + 1))));
return 0;
}
I assumed that the machine represent numbers as 2's complement on 32-bit.
They both have to be -1 which is all bits are 1 but first one is 0 and second one is -1.
I think both are exactly same code except whether using variable or constant.
I used gcc with option -m32 on Mac of i5 CPU.
What's wrong with it?
Thanks.

The short answer
You're evaluating the same expression in two different ways—once at runtime on an x86, and once at compile time. (And I assume you've disabled optimizations when you compile, see below.)
The long answer
Looking at the disassembled executable I notice the following: the argument to the first printf() is computed at runtime:
movl $0x0,-0x10(%ebp)
mov -0x10(%ebp),%ecx ; ecx = 0 (int n)
mov $0x20,%edx ; edx = 32
sub %ecx,%edx ; edx = 32-0 = 32
mov %edx,%ecx ; ecx = 32
mov $0x1,%edx ; edx = 1
shl %cl,%edx ; edx = 1 << (32 & 31) = 1 << 0 = 1
add $0xffffffff,%edx ; edx = -1 + 1 = 0
The shift is performed by an x86 SHL instruction with %cl as its operator. As per Intel manual: "The destination operand can be a register or a memory location. The count operand can be an immediate value or register CL. The count is masked to five bits, which limits the count range to 0 to 31. A special opcode encoding is provided for a count of 1."
For the above code, that means that you're shifting by 0, thus leaving the 1 in place after the shift instruction.
In contrast, the argument to the second printf() is essentially a constant expression that is computed by the compiler, and the compiler does not mask the shift amount. It therefore performs a "correct" shift of a 32b value: 1<<32 = 0 It then adds -1 to that—and you see the 0+(-1) = -1 as a result.
This also explains why you see only one warning: left shift count >= width of type and not two, as the warning stems from the compiler evaluating the shift of a 32b value by 32 bits. The compiler did not issue any warning regarding the runtime shift.
Reduced test case
The following is a reduction of your example to its essentials:
#define N 0
int n = 0;
printf("%d %d\n", 1<<(32-N) /* compiler */, 1<<(32-n) /* runtime */);
which prints 0 1 demonstrating the different results of the shift.
A word of caution
Note that the above example works only in -O0 compiled code, where you don't have the compiler optimize (evaluate and fold) constant expressions at compile time. If you take the reduced test case and compile it with -O3 then you get the same and correct results 0 0 from this optimized code:
movl $0x0,0x8(%esp)
movl $0x0,0x4(%esp)
I would think that if you change the compiler options for your test, you will see the same changed behavior.
Note There seems to be a code-gen bug in gcc-4.2.1 (and others?) where the runtime result is just off 0 8027 due to a broken optimization.

A simplified example
unsigned n32 = 32;
printf("%d\n", (int) sizeof(int)); // 4
printf("%d\n", (0x01 << n32)); // 1
printf("%d\n", (0x01 << 32)); // 0
You get UB in (0x01 << n32) as the shift >= width of int. (Looks like only 5 lsbits of n32 participated in the shift. Hence a shift of 0.)
You get a UB in (0x01 << 32) as the shift >= width of int. (Looks like complier performed the math with more bits.) This UB could have been the same as above.

How to calculate 2^n-1 efficiently without overflow?

I want to calculate 2n-1 for a 64bit integer value.
What I currently do is this
for(i=0; i<n; i++) r|=1<<i;
and I wonder if there is more elegant way to do it.
The line is in an inner loop, so I need it to be fast.
I thought of
r=(1ULL<<n)-1;
but it doesn't work for n=64, because << is only defined
for values of n up to 63.
EDIT:
Thanks for all your answers and comments.
Here is a little table with the solutions that I tried and liked best.
Second column is time in seconds of my (completely unscientific) benchmark.
r=N2MINUSONE_LUT[n]; 3.9 lookup table = fastest, answer by aviraldg
r =n?~0ull>>(64 - n):0ull; 5.9 fastest without LUT, comment by Christoph
r=(1ULL<<n)-1; 5.9 Obvious but WRONG!
r =(n==64)?-1:(1ULL<<n)-1; 7.0 Short, clear and quite fast, answer by Gabe
r=((1ULL<<(n/2))<<((n+1)/2))-1; 8.2 Nice, w/o spec. case, answer by drawnonward
r=(1ULL<<n-1)+((1ULL<<n-1)-1); 9.2 Nice, w/o spec. case, answer by David Lively
r=pow(2, n)-1; 99.0 Just for comparison
for(i=0; i<n; i++) r|=1<<i; 123.7 My original solution = lame
I accepted
r =n?~0ull>>(64 - n):0ull;
as answer because it's in my opinion the most elegant solution.
It was Christoph who came up with it at first, but unfortunately he only posted it in a
comment. Jens Gustedt added a really nice rationale, so I accept his answer instead. Because I liked Aviral Dasgupta's lookup table solution it got 50 reputation points via a bounty.

Use a lookup table. (Generated by your present code.) This is ideal, since the number of values is small, and you know the results already.
/* lookup table: n -> 2^n-1 -- do not touch */
const static uint64_t N2MINUSONE_LUT[] = {
0x0,
0x1,
0x3,
0x7,
0xf,
0x1f,
0x3f,
0x7f,
0xff,
0x1ff,
0x3ff,
0x7ff,
0xfff,
0x1fff,
0x3fff,
0x7fff,
0xffff,
0x1ffff,
0x3ffff,
0x7ffff,
0xfffff,
0x1fffff,
0x3fffff,
0x7fffff,
0xffffff,
0x1ffffff,
0x3ffffff,
0x7ffffff,
0xfffffff,
0x1fffffff,
0x3fffffff,
0x7fffffff,
0xffffffff,
0x1ffffffff,
0x3ffffffff,
0x7ffffffff,
0xfffffffff,
0x1fffffffff,
0x3fffffffff,
0x7fffffffff,
0xffffffffff,
0x1ffffffffff,
0x3ffffffffff,
0x7ffffffffff,
0xfffffffffff,
0x1fffffffffff,
0x3fffffffffff,
0x7fffffffffff,
0xffffffffffff,
0x1ffffffffffff,
0x3ffffffffffff,
0x7ffffffffffff,
0xfffffffffffff,
0x1fffffffffffff,
0x3fffffffffffff,
0x7fffffffffffff,
0xffffffffffffff,
0x1ffffffffffffff,
0x3ffffffffffffff,
0x7ffffffffffffff,
0xfffffffffffffff,
0x1fffffffffffffff,
0x3fffffffffffffff,
0x7fffffffffffffff,
0xffffffffffffffff,
};

How about a simple r = (n == 64) ? -1 : (1ULL<<n)-1;?

If you want to get the max value just before overflow with a given number of bits, try
r=(1ULL << n-1)+((1ULL<<n-1)-1);
By splitting the shift into two parts (in this case, two 63 bit shifts, since 2^64=2*2^63), subtracting 1 and then adding the two results together, you should be able to do the calculation without overflowing the 64 bit data type.

if (n > 64 || n < 0)
return undefined...
if (n == 64)
return 0xFFFFFFFFFFFFFFFFULL;
return (1ULL << n) - 1;

I like aviraldg answer best.
Just to get rid of the `ULL' stuff etc in C99 I would do
static inline uint64_t n2minusone(unsigned n) {
return n ? (~(uint64_t)0) >> (64u - n) : 0;
}
To see that this is valid
an uint64_t is guaranteed to have a width of exactly 64 bit
the bit negation of that `zero of type uint64_t' has thus exactly
64 one bits
right shift of an unsigned value is guaranteed to be a logical
shift, so everything is filled with zeros from the left
shift with a value equal or greater to the width is undefined, so
yes you have to do at least one conditional to be sure of your result
an inline function (or alternatively a cast to uint64_t if you
prefer) makes this type safe; an unsigned long long may
well be an 128 bit wide value in the future
a static inline function should be seamlessly
inlined in the caller without any overhead

The only problem is that your expression isn't defined for n=64? Then special-case that one value.
(n == 64 ? 0ULL : (1ULL << n)) - 1ULL

Shifting 1 << 64 in a 64 bit integer yields 0, so no need to compute anything for n > 63; shifting should be enough fast
r = n < 64 ? (1ULL << n) - 1 : 0;
But if you are trying this way to know the max value a N bit unsigned integer can have, you change 0 into the known value treating n == 64 as a special case (and you are not able to give a result for n > 64 on hardware with 64bit integer unless you use a multiprecision/bignumber library).
Another approach with bit tricks
~-(1ULL << (n-1) ) | (1ULL << (n-1))
check if it can be semplified... of course, n>0
EDIT
Tests I've done
__attribute__((regparm(0))) unsigned int calcn(int n)
{
register unsigned int res;
asm(
" cmpl $32, %%eax\n"
" jg mmno\n"
" movl $1, %%ebx\n" // ebx = 1
" subl $1, %%eax\n" // eax = n - 1
" movb %%al, %%cl\n" // because of only possible shll reg mode
" shll %%cl, %%ebx\n" // ebx = ebx << eax
" movl %%ebx, %%eax\n" // eax = ebx
" negl %%ebx\n" // -ebx
" notl %%ebx\n" // ~-ebx
" orl %%ebx, %%eax\n" // ~-ebx | ebx
" jmp mmyes\n"
"mmno:\n"
" xor %%eax, %%eax\n"
"mmyes:\n"
:
"=eax" (res):
"eax" (n):
"ebx", "ecx", "cc"
);
return res;
}
#define BMASK(X) (~-(1ULL << ((X)-1) ) | (1ULL << ((X)-1)))
int main()
{
int n = 32; //...
printf("%08X\n", BMASK(n));
printf("%08X %d %08X\n", calcn(n), n&31, BMASK(n&31));
return 0;
}
Output with n = 32 is -1 and -1, while n = 52 yields "-1" and 0xFFFFF, casually 52&31 = 20 and of course n = 20 gives 0xFFFFF...
EDIT2 now the asm code produces 0 for n > 32 (since I am on a 32 bit machine), but at this point the a ? b : 0 solution with the BMASK is clearer and I doubt the asm solution is too much faster (if speed is a so big concern the table idea could be the faster).

Since you've asked for an elegant way to do it:
const uint64_t MAX_UINT64 = 0xffffffffffffffffULL;
#define N2MINUSONE(n) ((MAX_UINT64>>(64-(n))))

I hate it that (a) n << 64 is undefined and (b) on the popular Intel hardware shifting by word size is a no-op.
You have three ways to go here:
Lookup table. I recommend against this because of the memory traffic, plus you will write a lot of code to maintain the memory traffic.
Conditional branch. Check if n is equal to the word size (8 * sizeof(unsigned long long)), if so, return ~(unsigned long long)0, otherwise shift and subtract as usual.
Try to get clever with arithmetic. For example, in real numbers 2^n = 2^(n-1) + 2^(n-1), and you can exploit this identity to make sure you never use a power equal to the word size. But you had better be very sure that n is never zero, because if it is, this identity cannot be expressed in the integers, and shifting left by -1 is likely to bite you in the ass.
I personally would go with the conditional branch—it is the hardest to screw up, manifestly handles all reasonable cases of n, and with modern hardware the likelihood of a branch misprediction is small. Here's what I do in my real code:
/* What makes things hellish is that C does not define the effects of
a 64-bit shift on a 64-bit value, and the Intel hardware computes
shifts mod 64, so that a 64-bit shift has the same effect as a
0-bit shift. The obvious workaround is to define new shift functions
that can shift by 64 bits. */
static inline uint64_t shl(uint64_t word, unsigned bits) {
assert(bits <= 64);
if (bits == 64)
return 0;
else
return word << bits;
}

I think the issue you are seeing is caused because (1<<n)-1 is evaluated as (1<<(n%64))-1 on some chips. Especially if n is or can be optimized as a constant.
Given that, there are many minor variations you can do. For example:
((1ULL<<(n/2))<<((n+1)/2))-1;
You will have to measure to see if that is faster then special casing 64:
(n<64)?(1ULL<<n)-1:~0ULL;

It is true that in C each bit-shifting operation has to shift by less bits than there are bits in the operand (otherwise, the behavior is undefined). However, nobody prohibits you from doing the shift in two consecutive steps
r = ((1ULL << (n - 1)) << 1) - 1;
I.e. shift by n - 1 bits first and then make an extra 1 bit shift. In this case, of course, you have to handle n == 0 situation in a special way, if that is a valid input in your case.
In any case, it is better than your for cycle. The latter is basically the same idea but taken to the extreme for some reason.

Ub = universe in bits = lg(U):
high(v) = v >> (Ub / 2)
low(v) = v & ((~0) >> (Ub - Ub / 2)) // Deal with overflow and with Ub even or odd

You can exploit integer division inaccuracy and use the modulo of the exponent to ensure you always shift in the range [0, (sizeof(uintmax_t) * CHAR_BIT) - 1] to create a universal pow2i function for integers of the largest supported native word size, however, this can easily be tweaked to support arbitrary word sizes.
I honestly don't get why this isn't just the implementation in hardware for bit shift overflows.
#include <limits.h>
static inline uintmax_t pow2i(uintmax_t exponent) {
#define WORD_BITS ( sizeof(uintmax_t) * CHAR_BIT )
return ((uintmax_t) 1) << (exponent / WORD_BITS) << (exponent % WORD_BITS);
#undef WORD_BITS
}
From there, you can calculate pow2i(n) - 1.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight