How can I compute a * b / c when both a and b are smaller than c, but a * b overflows? - c

Assuming that uint is the largest integral type on my fixed-point platform, I have:
uint func(uint a, uint b, uint c);
Which needs to return a good approximation of a * b / c.
The value of c is greater than both the value of a and the value of b.
So we know for sure that the value of a * b / c would fit in a uint.
However, the value of a * b itself overflows the size of a uint.
So one way to compute the value of a * b / c would be:
return a / c * b;
Or even:
if (a > b)
return a / c * b;
return b / c * a;
However, the value of c is greater than both the value of a and the value of b.
So the suggestion above would simply return zero.
I need to reduce a * b and c proportionally, but again - the problem is that a * b overflows.
Ideally, I would be able to:
Replace a * b with uint(-1)
Replace c with uint(-1) / a / b * c.
But no matter how I order the expression uint(-1) / a / b * c, I encounter a problem:
uint(-1) / a / b * c is truncated to zero because of uint(-1) / a / b
uint(-1) / a * c / b overflows because of uint(-1) / a * c
uint(-1) * c / a / b overflows because of uint(-1) * c
How can I tackle this scenario in order to find a good approximation of a * b / c?
Edit 1
I do not have things such as _umul128 on my platform, when the largest integral type is uint64. My largest type is uint, and I have no support for anything larger than that (neither on the HW level, nor in some pre-existing standard library).
My largest type is uint.
Edit 2
In response to numerous duplicate suggestions and comments:
I do not have some "larger type" at hand, which I can use for solving this problem. That is why the opening statement of the question is:
Assuming that uint is the largest integral type on my fixed-point platform
I am assuming that no other type exists, neither on the SW layer (via some built-in standard library) nor on the HW layer.

needs to return a good approximation of a * b / c
My largest type is uint
both a and b are smaller than c
Variation on this 32-bit problem:
Algorithm: Scale a, b to not overflow
SQRT_MAX_P1 as a compile time constant of sqrt(uint_MAX + 1)
sh = 0;
if (c >= SQRT_MAX_P1) {
while (|a| >= SQRT_MAX_P1) a/=2, sh++
while (|b| >= SQRT_MAX_P1) b/=2, sh++
while (|c| >= SQRT_MAX_P1) c/=2, sh--
}
result = a*b/c
shift result by sh.
With an n-bit uint, I expect the result to be correct to at least about n/2 significant digits.
Could improve things by taking advantage of the smaller of a,b being less than SQRT_MAX_P1. More on that later if interested.
Example
#include <inttypes.h>
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
// https://stackoverflow.com/a/4589384/2410359
#define UINTMAX_WIDTH (IMAX_BITS(UINTMAX_MAX))
#define SQRT_UINTMAX_P1 (((uintmax_t)1ull) << (UINTMAX_WIDTH/2))
uintmax_t muldiv_about(uintmax_t a, uintmax_t b, uintmax_t c) {
int shift = 0;
if (c > SQRT_UINTMAX_P1) {
while (a >= SQRT_UINTMAX_P1) {
a /= 2; shift++;
}
while (b >= SQRT_UINTMAX_P1) {
b /= 2; shift++;
}
while (c >= SQRT_UINTMAX_P1) {
c /= 2; shift--;
}
}
uintmax_t r = a * b / c;
if (shift > 0) r <<= shift;
if (shift < 0) r >>= shift;
return r;
}
#include <stdio.h>
int main() {
uintmax_t a = 12345678;
uintmax_t b = 4235266395;
uintmax_t c = 4235266396;
uintmax_t r = muldiv_about(a,b,c);
printf("%ju\n", r);
}
Output with 32-bit math (Precise answer is 12345677)
12345600
Output with 64-bit math
12345677

Here is another approach that uses recursion and minimal approximation to achieve high precision.
First the code and below an explanation.
Code:
uint32_t bp(uint32_t a) {
uint32_t b = 0;
while (a!=0)
{
++b;
a >>= 1;
};
return b;
}
int mul_no_ovf(uint32_t a, uint32_t b)
{
return ((bp(a) + bp(b)) <= 32);
}
uint32_t f(uint32_t a, uint32_t b, uint32_t c)
{
if (mul_no_ovf(a, b))
{
return (a*b) / c;
}
uint32_t m = c / b;
++m;
uint32_t x = m*b - c;
// So m * b == c + x where x < b and m >= 2
uint32_t n = a/m;
uint32_t r = a % m;
// So a*b == n * (c + x) + r*b == n*c + n*x + r*b where r*b < c
// Approximation: get rid of the r*b part
uint32_t res = n;
if (r*b > c/2) ++res;
return res + f(n, x, c);
}
Explanation:
The multiplication a * b can be written as a sum of b
a * b = b + b + .... + b
Since b < c we can take a number m of these b so that (m-1)*b < c <= m*b, like
(b + b + ... + b) + (b + b + ... + b) + .... + b + b + b
\---------------/ \---------------/ + \-------/
m*b + m*b + .... + r*b
\-------------------------------------/
n times m*b
so we have
a*b = n*m*b + r*b
where r*b < c and m*b > c. Consequently, m*b is equal to c + x, so we have
a*b = n*(c + x) + r*b = n*c + n*x + r*b
Divide by c :
a*b/c = (n*c + n*x + r*b)/c = n + n*x/c + r*b/c
The values m, n, x, r can all be calculated from a, b and c without any loss of
precision using integer division (/) and remainder (%).
The approximation is to look at r*b (which is less than c) and "add zero" when r*b<=c/2
and "add one" when r*b>c/2.
So now there are two possibilities:
1) a*b = n + n*x/c
2) a*b = (n + 1) + n*x/c
So the problem (i.e. calculating a*b/c) has been changed to the form
MULDIV(a1,b1,c) = NUMBER + MULDIV(a2,b2,c)
where a2,b2 is less than a1,b2. Consequently, recursion can be used until
a2*b2 no longer overflows (and the calculation can be done directly).

I've established a solution which work in O(1) complexity (no loops):
typedef unsigned long long uint;
typedef struct
{
uint n;
uint d;
}
fraction;
uint func(uint a, uint b, uint c);
fraction reducedRatio(uint n, uint d, uint max);
fraction normalizedRatio(uint a, uint b, uint scale);
fraction accurateRatio(uint a, uint b, uint scale);
fraction toFraction(uint n, uint d);
uint roundDiv(uint n, uint d);
uint func(uint a, uint b, uint c)
{
uint hi = a > b ? a : b;
uint lo = a < b ? a : b;
fraction f = reducedRatio(hi, c, (uint)(-1) / lo);
return f.n * lo / f.d;
}
fraction reducedRatio(uint n, uint d, uint max)
{
fraction f = toFraction(n, d);
if (n > max || d > max)
f = normalizedRatio(n, d, max);
if (f.n != f.d)
return f;
return toFraction(1, 1);
}
fraction normalizedRatio(uint a, uint b, uint scale)
{
if (a <= b)
return accurateRatio(a, b, scale);
fraction f = accurateRatio(b, a, scale);
return toFraction(f.d, f.n);
}
fraction accurateRatio(uint a, uint b, uint scale)
{
uint maxVal = (uint)(-1) / scale;
if (a > maxVal)
{
uint c = a / (maxVal + 1) + 1;
a /= c; // we can now safely compute `a * scale`
b /= c;
}
if (a != b)
{
uint n = a * scale;
uint d = a + b; // can overflow
if (d >= a) // no overflow in `a + b`
{
uint x = roundDiv(n, d); // we can now safely compute `scale - x`
uint y = scale - x;
return toFraction(x, y);
}
if (n < b - (b - a) / 2)
{
return toFraction(0, scale); // `a * scale < (a + b) / 2 < MAXUINT256 < a + b`
}
return toFraction(1, scale - 1); // `(a + b) / 2 < a * scale < MAXUINT256 < a + b`
}
return toFraction(scale / 2, scale / 2); // allow reduction to `(1, 1)` in the calling function
}
fraction toFraction(uint n, uint d)
{
fraction f = {n, d};
return f;
}
uint roundDiv(uint n, uint d)
{
return n / d + n % d / (d - d / 2);
}
Here is my test:
#include <stdio.h>
int main()
{
uint a = (uint)(-1) / 3; // 0x5555555555555555
uint b = (uint)(-1) / 2; // 0x7fffffffffffffff
uint c = (uint)(-1) / 1; // 0xffffffffffffffff
printf("0x%llx", func(a, b, c)); // 0x2aaaaaaaaaaaaaaa
return 0;
}

You can cancel prime factors as follows:
uint gcd(uint a, uint b)
{
uint c;
while (b)
{
a %= b;
c = a;
a = b;
b = c;
}
return a;
}
uint func(uint a, uint b, uint c)
{
uint temp = gcd(a, c);
a = a/temp;
c = c/temp;
temp = gcd(b, c);
b = b/temp;
c = c/temp;
// Since you are sure the result will fit in the variable, you can simply
// return the expression you wanted after having those terms canceled.
return a * b / c;
}

Related

Efficient computation of the average of three unsigned integers (without overflow)

There is an existing question "Average of 3 long integers" that is specifically concerned with the efficient computation of the average of three signed integers.
The use of unsigned integers however allows for additional optimizations not applicable to the scenario covered in the previous question. This question is about the efficient computation of the average of three unsigned integers, where the average is rounded towards zero, i.e. in mathematical terms I want to compute ⌊ (a + b + c) / 3 ⌋.
A straightforward way to compute this average is
avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
To first order, modern optimizing compilers will transform the divisions into multiplications with a reciprocal plus a shift, and the modulo operations into a back-multiply and a subtraction, where the back-multiply may use a scale_add idiom available on many architectures, e.g. lea on x86_64, add with lsl #n on ARM, iscadd on NVIDIA GPUs.
In trying to optimize the above in a generic fashion suitable for many common platforms, I observe that typically the cost of integer operations is in the relationship logical ≤ (add | sub) ≤ shift ≤ scale_add ≤ mul. Cost here refers to all of latency, throughput limitations, and power consumption. Any such differences become more pronounced when the integer type processed is wider than the native register width, e.g. when processing uint64_t data on a 32-bit processor.
My optimization strategy was therefore to minimize instruction count and replace "expensive" with "cheap" operations where possible, while not increasing register pressure and retaining exploitable parallelism for wide out-of-order processors.
The first observation is that we can reduce a sum of three operands into a sum of two operands by first applying a CSA (carry save adder) that produces a sum value and a carry value, where the carry value has twice the weight of the sum value. The cost of a software-based CSA is five logicals on most processors. Some processors, like NVIDIA GPUs, have a LOP3 instruction that can compute an arbitrary logical expression of three operands in one fell swoop, in which case CSA condenses to two LOP3s (note: I have yet convince the CUDA compiler to emit those two LOP3s; it currently produces four LOP3s!).
The second observation is that because we are computing the modulo of division by 3, we don't need a back-multiply to compute it. We can instead use dividend % 3 = ((dividend / 3) + dividend) & 3, reducing the modulo to an add plus a logical since we already have the division result. This is an instance of the general algorithm: dividend % (2n-1) = ((dividend / (2n-1) + dividend) & (2n-1).
Finally for the division by 3 in the correction term (a % 3 + b % 3 + c % 3) / 3 we don't need the code for generic division by 3. Since the dividend is very small, in [0, 6], we can simplify x / 3 into (3 * x) / 8 which requires just a scale_add plus a shift.
The code below shows my current work-in-progress. Using Compiler Explorer to check the code generated for various platforms shows the tight code I would expect (when compiled with -O3).
However, in timing the code on my Ivy Bridge x86_64 machine using the Intel 13.x compiler, a flaw became apparent: while my code improves latency (from 18 cycles to 15 cycles for uint64_t data) compared to the simple version, throughput worsens (from one result every 6.8 cycles to one result every 8.5 cycles for uint64_t data). Looking at the assembly code more closely it is quite apparent why that is: I basically managed to take the code down from roughly three-way parallelism to roughly two-way parallelism.
Is there a generically applicable optimization technique, beneficial on common processors in particular all flavors of x86 and ARM as well as GPUs, that preserves more parallelism? Alternatively, is there an optimization technique that further reduces overall operation count to make up for reduced parallelism? The computation of the correction term (tail in the code below) seems like a good target. The simplification (carry_mod_3 + sum_mod_3) / 2 looked enticing but delivers an incorrect result for one of the nine possible combinations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define BENCHMARK (1)
#define SIMPLE_COMPUTATION (0)
#if BENCHMARK
#define T uint64_t
#else // !BENCHMARK
#define T uint8_t
#endif // BENCHMARK
T average_of_3 (T a, T b, T c)
{
T avg;
#if SIMPLE_COMPUTATION
avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
#else // !SIMPLE_COMPUTATION
/* carry save adder */
T a_xor_b = a ^ b;
T sum = a_xor_b ^ c;
T carry = (a_xor_b & c) | (a & b);
/* here 2 * carry + sum = a + b + c */
T sum_div_3 = (sum / 3); // {MUL|MULHI}, SHR
T sum_mod_3 = (sum + sum_div_3) & 3; // ADD, AND
if (sizeof (size_t) == sizeof (T)) { // "native precision" (well, not always)
T two_carry_div_3 = (carry / 3) * 2; // MULHI, ANDN
T two_carry_mod_3 = (2 * carry + two_carry_div_3) & 6; // SCALE_ADD, AND
T head = two_carry_div_3 + sum_div_3; // ADD
T tail = (3 * (two_carry_mod_3 + sum_mod_3)) / 8; // ADD, SCALE_ADD, SHR
avg = head + tail; // ADD
} else {
T carry_div_3 = (carry / 3); // MUL, SHR
T carry_mod_3 = (carry + carry_div_3) & 3; // ADD, AND
T head = (2 * carry_div_3 + sum_div_3); // SCALE_ADD
T tail = (3 * (2 * carry_mod_3 + sum_mod_3)) / 8; // SCALE_ADD, SCALE_ADD, SHR
avg = head + tail; // ADD
}
#endif // SIMPLE_COMPUTATION
return avg;
}
#if !BENCHMARK
/* Test correctness on 8-bit data exhaustively. Should catch most errors */
int main (void)
{
T a, b, c, res, ref;
a = 0;
do {
b = 0;
do {
c = 0;
do {
res = average_of_3 (a, b, c);
ref = ((uint64_t)a + (uint64_t)b + (uint64_t)c) / 3;
if (res != ref) {
printf ("a=%08x b=%08x c=%08x res=%08x ref=%08x\n",
a, b, c, res, ref);
return EXIT_FAILURE;
}
c++;
} while (c);
b++;
} while (b);
a++;
} while (a);
return EXIT_SUCCESS;
}
#else // BENCHMARK
#include <math.h>
// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
#define N (3000000)
int main (void)
{
double start, stop, elapsed = INFINITY;
int i, k;
T a, b;
T avg0 = 0xffffffff, avg1 = 0xfffffffe;
T avg2 = 0xfffffffd, avg3 = 0xfffffffc;
T avg4 = 0xfffffffb, avg5 = 0xfffffffa;
T avg6 = 0xfffffff9, avg7 = 0xfffffff8;
T avg8 = 0xfffffff7, avg9 = 0xfffffff6;
T avg10 = 0xfffffff5, avg11 = 0xfffffff4;
T avg12 = 0xfffffff2, avg13 = 0xfffffff2;
T avg14 = 0xfffffff1, avg15 = 0xfffffff0;
a = 0x31415926;
b = 0x27182818;
avg0 = average_of_3 (a, b, avg0);
for (k = 0; k < 5; k++) {
start = second();
for (i = 0; i < N; i++) {
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
avg0 = average_of_3 (a, b, avg0);
b = (b + avg0) ^ a;
a = (a ^ b) + avg0;
}
stop = second();
elapsed = fmin (stop - start, elapsed);
}
printf ("a=%016llx b=%016llx avg=%016llx",
(uint64_t)a, (uint64_t)b, (uint64_t)avg0);
printf ("\rlatency: each average_of_3() took %.6e seconds\n",
elapsed / 16 / N);
a = 0x31415926;
b = 0x27182818;
avg0 = average_of_3 (a, b, avg0);
for (k = 0; k < 5; k++) {
start = second();
for (i = 0; i < N; i++) {
avg0 = average_of_3 (a, b, avg0);
avg1 = average_of_3 (a, b, avg1);
avg2 = average_of_3 (a, b, avg2);
avg3 = average_of_3 (a, b, avg3);
avg4 = average_of_3 (a, b, avg4);
avg5 = average_of_3 (a, b, avg5);
avg6 = average_of_3 (a, b, avg6);
avg7 = average_of_3 (a, b, avg7);
avg8 = average_of_3 (a, b, avg8);
avg9 = average_of_3 (a, b, avg9);
avg10 = average_of_3 (a, b, avg10);
avg11 = average_of_3 (a, b, avg11);
avg12 = average_of_3 (a, b, avg12);
avg13 = average_of_3 (a, b, avg13);
avg14 = average_of_3 (a, b, avg14);
avg15 = average_of_3 (a, b, avg15);
b = (b + avg0) ^ a;
a = (a ^ b) + avg0;
}
stop = second();
elapsed = fmin (stop - start, elapsed);
}
printf ("a=%016llx b=%016llx avg=%016llx", (uint64_t)a, (uint64_t)b,
(uint64_t)(avg0 + avg1 + avg2 + avg3 + avg4 + avg5 + avg6 + avg7 +
avg8 + avg9 +avg10 +avg11 +avg12 +avg13 +avg14 +avg15));
printf ("\rthroughput: each average_of_3() took %.6e seconds\n",
elapsed / 16 / N);
return EXIT_SUCCESS;
}
#endif // BENCHMARK
Let me throw my hat in the ring. Not doing anything too tricky here, I
think.
#include <stdint.h>
uint64_t average_of_three(uint64_t a, uint64_t b, uint64_t c) {
uint64_t hi = (a >> 32) + (b >> 32) + (c >> 32);
uint64_t lo = hi + (a & 0xffffffff) + (b & 0xffffffff) + (c & 0xffffffff);
return 0x55555555 * hi + lo / 3;
}
Following discussion below about different splits, here's a version that saves a multiply at the expense of three bitwise-ANDs:
T hi = (a >> 2) + (b >> 2) + (c >> 2);
T lo = (a & 3) + (b & 3) + (c & 3);
avg = hi + (hi + lo) / 3;
I'm not sure if it fits your requirements, but maybe it works to just calculate the result and then fixup the error from the overflow:
T average_of_3 (T a, T b, T c)
{
T r = ((T) (a + b + c)) / 3;
T o = (a > (T) ~b) + ((T) (a + b) > (T) (~c));
if (o) r += ((T) 0x5555555555555555) << (o - 1);
T rem = ((T) (a + b + c)) % 3;
if (rem >= (3 - o)) ++r;
return r;
}
[EDIT] Here is the best branch-and-compare-less version I can come up with. On my machine, this version actually has slightly higher throughput than njuffa's code. __builtin_add_overflow(x, y, r) is supported by gcc and clang and returns 1 if the sum x + y overflows the type of *r and 0 otherwise, so the calculation of o is equivalent to the portable code in the first version, but at least gcc produces better code with the builtin.
T average_of_3 (T a, T b, T c)
{
T r = ((T) (a + b + c)) / 3;
T rem = ((T) (a + b + c)) % 3;
T dummy;
T o = __builtin_add_overflow(a, b, &dummy) + __builtin_add_overflow((T) (a + b), c, &dummy);
r += -((o - 1) & 0xaaaaaaaaaaaaaaab) ^ 0x5555555555555555;
r += (rem + o + 1) >> 2;
return r;
}
I answered the question you linked to already, so I am only answering the part that is different about this one: performance.
If you really cared about performance, then the answer is:
( a + b + c ) / 3
Since you cared about performance, you should have an intuition about the size of data you are working with. You should not have worried about overflow on addition (multiplication is another matter) of only 3 values, because if your data is already big enough to use the high bits of your chosen data type, you are in danger of overflow anyway and should have used a larger integer type. If you are overflowing on uint64_t, then you should really ask yourself why exactly do you need to count accurately up to 18 quintillion, and perhaps consider using float or double.
Now, having said all that, I will give you my actual reply: It doesn't matter. The question doesn't come up in real life and when it does, perf doesn't matter.
It could be a real performance question if you are doing it a million times in SIMD, because there, you are really incentivized to use integers of smaller width and you may need that last bit of headroom, but that wasn't your question.
New answer, new idea. This one's based on the mathematical identity
floor((a+b+c)/3) = floor(x + (a+b+c - 3x)/3)
When does this work with machine integers and unsigned division?
When the difference doesn't wrap, i.e. 0 ≤ a+b+c - 3x ≤ T_MAX.
This definition of x is fast and gets the job done.
T avg3(T a, T b, T c) {
T x = (a >> 2) + (b >> 2) + (c >> 2);
return x + (a + b + c - 3 * x) / 3;
}
Weirdly, ICC inserts an extra neg unless I do this:
T avg3(T a, T b, T c) {
T x = (a >> 2) + (b >> 2) + (c >> 2);
return x + (a + b + c - (x + x * 2)) / 3;
}
Note that T must be at least five bits wide.
If T is two platform words long, then you can save some double word operations by omitting the low word of x.
Alternative version with worse latency but maybe slightly higher throughput?
T lo = a + b;
T hi = lo < b;
lo += c;
hi += lo < c;
T x = (hi << (sizeof(T) * CHAR_BIT - 2)) + (lo >> 2);
avg = x + (T)(lo - 3 * x) / 3;
I suspect SIMPLE is defeating the throughput benchmark by CSEing and hoisting a/3+b/3 and a%3+b%3 out of the loop, reusing those results for all 16 avg0..15 results.
(The SIMPLE version can hoist much more of the work than the tricky version; really just a ^ b and a & b in that version.)
Forcing the function to not inline introduces more front-end overhead, but does make your version win, as we expect it should on a CPU with deep out-of-order execution buffers to overlap independent work. There's lots of ILP to find across iterations, for the throughput benchmark. (I didn't look closely at the asm for the non-inline version.)
https://godbolt.org/z/j95qn3 (using __attribute__((noinline)) with clang -O3 -march=skylake on Godbolt's SKX CPUs) shows 2.58 nanosec throughput for the simple way, 2.48 nanosec throughput for your way. vs. 1.17 nanosec throughput with inlining for the simple version.
-march=skylake allows mulx for more flexible full-multiply, but otherwise no benefit from BMI2. andn isn't used; the line you commented with mulhi / andn is mulx into RCX / and rcx, -2 which only requires a sign-extended immediate.
Another way to do this without forcing call/ret overhead would be inline asm like in Preventing compiler optimizations while benchmarking (Chandler Carruth's CppCon talk has some example of how he uses a couple wrappers), or Google Benchmark's benchmark::DoNotOptimize.
Specifically, GNU C asm("" : "+r"(a), "+r"(b)) between each avgX = average_of_3 (a, b, avgX); statement will make the compiler forget everything it knows about the values of a and b, while keeping them in registers.
My answer on I don't understand the definition of DoNotOptimizeAway goes into more detail about using a read-only "r" register constraint to force the compiler to materialize a result in a register, vs. "+r" to make it assume the value has been modified.
If you understand GNU C inline asm well, it may be easier to roll your own in ways that you know exactly what they do.
[Falk Hüffner points out in comments that this answer has similarities to his answer . Looking at his code more closely belatedly, I do find some similarities. However what I posted here is product of an independent thought process, a continuation of my original idea "reduce three items to two prior to div-mod". I understood Hüffner's approach to be different: "naive computation followed by corrections".]
I have found a better way than the CSA-technique in my question to reduce the division and modulo work from three operands to two operands. First, form the full double-word sum, then apply the division and modulo by 3 to each of the halves separately, finally combine the results. Since the most significant half can only take the values 0, 1, or 2, computing the quotient and remainder of division by three is trivial. Also, the combination into the final result becomes simpler.
Compared to the non-simple code variant from the question this achieves speedup on all platforms I examined. The quality of the code generated by compilers for the simulated double-word addition varies but is satisfactory overall. Nonetheless it may be worthwhile to code this portion in a non-portable way, e.g. with inline assembly.
T average_of_3_hilo (T a, T b, T c)
{
const T fives = (((T)(~(T)0)) / 3); // 0x5555...
T avg, hi, lo, lo_div_3, lo_mod_3, hi_div_3, hi_mod_3;
/* compute the full sum a + b + c into the operand pair hi:lo */
lo = a + b;
hi = lo < a;
lo = c + lo;
hi = hi + (lo < c);
/* determine quotient and remainder of each half separately */
lo_div_3 = lo / 3;
lo_mod_3 = (lo + lo_div_3) & 3;
hi_div_3 = hi * fives;
hi_mod_3 = hi;
/* combine partial results into the division result for the full sum */
avg = lo_div_3 + hi_div_3 + ((lo_mod_3 + hi_mod_3 + 1) / 4);
return avg;
}
An experimental build of GCC-11 compiles the obvious naive function to something like:
uint32_t avg3t (uint32_t a, uint32_t b, uint32_t c) {
a += b;
b = a < b;
a += c;
b += a < c;
b = b + a;
b += b < a;
return (a - (b % 3)) * 0xaaaaaaab;
}
Which is similar to some of the other answers posted here.
Any explanation of how these solutions work would be welcome
(not sure of the netiquette here).

modular exponentation funcation generate incorrect result for big input in c

I try two function for modular exponentiation for big base return wrong results,
One of the function is:
uint64_t modular_exponentiation(uint64_t x, uint64_t y, uint64_t p)
{
uint64_t res = 1; // Initialize result
x = x % p; // Update x if it is more than or
// equal to p
while (y > 0)
{
// If y is odd, multiply x with result
if (y & 1)
res = (res*x) % p;
// y must be even now
y = y>>1; // y = y/2
x = (x*x) % p;
}
return res;
}
For input x = 1103362698 ,y = 137911680 , p=1217409241131113809;
It return the value (x^y mod p):749298230523009574(Incorrect).
The correct value is:152166603192600961
The other function i try, gave same result, What is wrong with these functions?
The other one is :
long int exponentMod(long int A, long int B, long int C)
{
// Base cases
if (A == 0)
return 0;
if (B == 0)
return 1;
// If B is even
long int y;
if (B % 2 == 0) {
y = exponentMod(A, B / 2, C);
y = (y * y) % C;
}
// If B is odd
else {
y = A % C;
y = (y * exponentMod(A, B - 1, C) % C) % C;
}
return (long int)((y + C) % C);
}
With p = 1217409241131113809, this value as well as any intermediate values for res and x will be larger than 32 bits. This means that multiplying two of these numbers could result in a value larger than 64 bits which overflows the datatype you're using.
If you restrict the parameters to 32 bit datatypes and use 64 bit datatypes for intermediate values then the function will work. Otherwise you'll need to use a big number library to get correct output.

FLT_EPSILON for a nth root finder with SSE/AVX

I'm trying to convert a function that finds the nth root in C for a double value from the following link
http://rosettacode.org/wiki/Nth_root#C
to find the nth root for 8 floats at once using AVX.
Part of that code uses DBL_EPSILON * 10. However, when I convert this to use float with AVX I have to use FLT_EPSILON*1000 or the code hangs and does not converge. When I print out FLT_EPSILON I see it is order 1E-7. But this link, http://www.cplusplus.com/reference/cfloat/
, says it should be 1E-5. When I print out DBL_EPSILON it's 1E-16 but the link says it should only be 1E-9. What's going on?
Here is the code so far (not fully optimized).
#include <stdio.h>
#include <float.h>
#include <immintrin.h> // AVX
inline double abs_(double x) {
return x >= 0 ? x : -x;
}
double pow_(double x, int e)
{
double ret = 1;
for (ret = 1; e; x *= x, e >>= 1) {
if ((e & 1)) ret *= x;
}
return ret;
}
double root(double a, int n)
{
double d, x = 1;
x = a/n;
if (!a) return 0;
//if (n < 1 || (a < 0 && !(n&1))) return 0./0.; /* NaN */
int cnt = 0;
do {
cnt++;
d = (a / pow_(x, n - 1) - x) / n;
x+= d;
} while (abs_(d) >= abs_(x) * (DBL_EPSILON * 10));
printf("%d\n", cnt);
return x;
}
__m256 pow_avx(__m256 x, int e) {
__m256 ret = _mm256_set1_ps(1.0f);
for (; e; x = _mm256_mul_ps(x,x), e >>= 1) {
if ((e & 1)) ret = _mm256_mul_ps(x,ret);
}
return ret;
}
inline __m256 abs_avx (__m256 x) {
return _mm256_max_ps(_mm256_sub_ps(_mm256_setzero_ps(), x), x);
//return x >= 0 ? x : -x;
}
int get_mask(const __m256 d, const __m256 x) {
__m256 ad = abs_avx(d);
__m256 ax = abs_avx(x);
__m256i mask = _mm256_castps_si256(_mm256_cmp_ps(ad, ax, _CMP_GT_OQ));
return _mm_movemask_epi8(_mm256_castsi256_si128(mask)) + _mm_movemask_epi8(_mm256_extractf128_si256(mask,1));
}
__m256 root_avx(__m256 a, int n) {
printf("%e\n", FLT_EPSILON);
printf("%e\n", DBL_EPSILON);
printf("%e\n", FLT_EPSILON*1000.0f);
__m256 d;
__m256 x = _mm256_set1_ps(1.0f);
//if (!a) return 0;
//if (n < 1 || (a < 0 && !(n&1))) return 0./0.; /* NaN */
__m256 in = _mm256_set1_ps(1.0f/n);
__m256 xtmp;
do {
d = _mm256_rcp_ps(pow_avx(x, n - 1));
d = _mm256_sub_ps(_mm256_mul_ps(a,d),x);
d = _mm256_mul_ps(d,in);
//d = (a / pow_avx(x, n - 1) - x) / n;
x = _mm256_add_ps(x, d); //x+= d;
xtmp =_mm256_mul_ps(x, _mm256_set1_ps(FLT_EPSILON*100.0f));
//} while (abs_(d) >= abs_(x) * (DBL_EPSILON * 10));
} while (get_mask(d, xtmp));
return x;
}
int main()
{
__m256 d = _mm256_set1_ps(16.0f);
__m256 out = root_avx(d, 4);
float result[8];
int i;
_mm256_storeu_ps(result, out);
for(i=0; i<8; i++) {
printf("%f\n", result[i]);
} printf("\n");
//double x = 16;
//printf("root(%g, 15) = %g\n", x, root(x, 4));
//double x = pow_(-3.14159, 15);
//printf("root(%g, 15) = %g\n", x, root(x, 15));
return 0;
}
_mm256_rcp_ps, which maps to the rcpps instruction, performs only an approximate reciprocal. The Intel 64 and IA-32 Architectures Software Developer’s Manual says its relative error may be up to 1.5•2-12. This is insufficient to cause the root finder to converge with accuracy 100*FLT_EPSILON.
You could use an exact division, such as:
d = pow_avx(x, n-1);
d = _mm256_sub_ps(_mm256_div_ps(a, d), x);
or add some refinement steps for the reciprocal estimate.
Incidentally, if your compiler supports using regular C operators with SIMD objects, consider using the regular C operators instead:
d = pow_avx(x, n-1);
d = a/d - x;
1e-5 is simply the maximum value the C standard allows an implementation to use for FLT_EPSILON. In practice, you'll be using IEEE-754 single-precision, which has an epsilon of 2-23, which is approximately 1e-7.

the Floating-point error

#include <stdio.h>
int main()
{
int n;
while ( scanf( "%d", &n ) != EOF ) {
double sum = 0,k;
if( n > 5000000 || n<=0 ) //the judgment of the arrange
break;
for ( int i = 1; i <= n; i++ ) {
k = (double) 1 / i;
sum += k;
}
/*
for ( int i = n; i > 0; i-- ) {
k = 1 / (double)i;
sum += k;
}
*/
printf("%.12lf\n", sum);
}
return 0;
}
Why in the different loop I get the different answer. Is there a float-error? When I input 5000000 the sum is 16.002164235299 but as I use the other loop of for (notation part) I get the sum 16.002164235300.
Because floating point math is not associative:
i.e. (a + b) + c is not necessarily equal to a + (b + c)
I also bumped into a + b + c issue. Totally agreed with ArjunShankar.
// Here A != B in general case
float A = ( (a + b) + c) );
float B = ( (a + c) + b) );
Most of floating point operations are performed with data loss in mantis, even when components are fit well in it (numbers like 0.5 or 0.25).
In fact I was quite happy to find out the cause of bug in my application. I have written short reminder article with detailed explanation:
http://stepan.dyatkovskiy.com/2018/04/machine-fp-partial-invariance-issue.html
Below is the C example. Good luck!
example.c
#include <stdio.h>
// Helpers declaration, for implementation scroll down
float getAllOnes(unsigned bits);
unsigned getMantissaBits();
int main() {
// Determine mantissa size in bits
unsigned mantissaBits = getMantissaBits();
// Considering mantissa has only 3 bits, we would then get:
// a = 0b10 m=1, e=1
// b = 0b110 m=11, e=1
// c = 0b1000 m=1, e=3
// a + b = 0b1000, m=100, e=1
// a + c = 0b1010, truncated to 0b1000, m=100, e=1
// a + b + c result: 0b1000 + 0b1000 = 0b10000, m=100, e=2
// a + c + b result: 0b1000 + 0b110 = 0b1110, m=111, e=1
float a = 2,
b = getAllOnes(mantissaBits) - 1,
c = b + 1;
float ab = a + b;
float ac = a + c;
float abc = a + b + c;
float acb = a + c + b;
printf("\n"
"FP partial invariance issue demo:\n"
"\n"
"Mantissa size = %i bits\n"
"\n"
"a = %.1f\n"
"b = %.1f\n"
"c = %.1f\n"
"(a+b) result: %.1f\n"
"(a+c) result: %.1f\n"
"(a + b + c) result: %.1f\n"
"(a + c + b) result: %.1f\n"
"---------------------------------\n"
"diff(a + b + c, a + c + b) = %.1f\n\n",
mantissaBits,
a, b, c,
ab, ac,
abc, acb,
abc - acb);
return 1;
}
// Helpers
float getAllOnes(unsigned bits) {
return (unsigned)((1 << bits) - 1);
}
unsigned getMantissaBits() {
unsigned sz = 1;
unsigned unbeleivableHugeSize = 1024;
float allOnes = 1;
for (;sz != unbeleivableHugeSize &&
allOnes + 1 != allOnes;
allOnes = getAllOnes(++sz)
) {}
return sz-1;
}

Overflow-safe modular addition and subtraction in C?

I'm implementing an algorithm in C that needs to do modular addition and subtraction quickly on unsigned integers and can handle overflow conditions correctly. Here's what I have now (which does work):
/* a and/or b may be greater than m */
uint32_t modadd_32(uint32_t a, uint32_t b, uint32_t m) {
uint32_t tmp;
if (b <= UINT32_MAX - a)
return (a + b) % m;
if (m <= (UINT32_MAX>>1))
return ((a % m) + (b % m)) % m;
tmp = a + b;
if (tmp > (uint32_t)(m * 2)) // m*2 must be truncated before compare
tmp -= m;
tmp -= m;
return tmp % m;
}
/* a and/or b may be greater than m */
uint32_t modsub_32(uint32_t a, uint32_t b, uint32_t m) {
uint32_t tmp;
if (a >= b)
return (a - b) % m;
tmp = (m - ((b - a) % m)); /* results in m when 0 is needed */
if (tmp == m)
return 0;
return tmp;
}
Anybody know of a better algorithm? The libraries I've found that do modular arithmetic all seem to be for large arbitrary precision numbers which is way overkill.
Edit: I want this to run well on a 32 bit machine. Also, my existing functions are trivially converted to work on other sizes of unsigned integers, a property which would be nice to retain.
Modular operations usually assume that a and b are less than m. This allows simpler algorithms:
umod_t sub_mod(umod_t a, umod_t b, umod_t m)
{
if ( a>=b )
return a - b;
else
return m - b + a;
}
umod_t add_mod(umod_t a, umod_t b, umod_t m)
{
if ( 0==b ) return a;
// return sub_mod(a, m-b, m);
b = m - b;
if ( a>=b )
return a - b;
else
return m - b + a;
}
Source: Matters Computational, chapter 39.1.
I'd just do the arithmetic in uint32_t if it fits and in uint64_t otherwise.
uint32_t modadd_32(uint32_t a, uint32_t b, uint32_t m) {
if (b <= UINT32_MAX - a)
return (a + b) % m;
else
return ((uint64_t)a + b) % m;
}
On an architecture with 64bit integer types, this should be almost no overhead, you could even think of just doing everything in uint64_t. On architectures where uint64_t is synthesized
let the compiler decide what he thinks is best, an then look into the generated assembler and mmeasure to see if this is satisfactory.
Overflow-safe modular addition
First establish that a<m and b<m with the usual % m.
Add updated a and b.
Should a (or b) exceed the uintN_t sum, then the mathematically sum was an uintN_t overflow and subtraction of m will "mod" the mathematically sum into the range of uintN_t.
If the sum exceeds m, then like the above step, a single subtraction of m will "mod" the sum.
uintN_t modadd_N(uintN_t a, uintN_t b, uintN_t m) {
// may omit these 2 steps if a < b and a < m are known before the call.
a %= m;
b %= m;
uintN_t sum = a + b;
if (sum >= m || sum < a) {
sum -= m;
}
return sum;
}
Quite simple in the end.
Overflow-safe modular subtraction
Variation on #Evgeny Kluev good answer.
uintN_t modsub_N(uintN_t a, uintN_t b, uintN_t m) {
// may omit these 2 steps if a < b and a < m are known before the call.
a %= m;
b %= m;
uintN_t diff = a - b;
if (a < b) {
diff += m;
}
return diff;
}
Note this approach works for various N such as 32, 64, 16 or unsigned, unsigned long, etc. without resorting to wider types. It also works for unsigned types narrower than int/unsigned.

Resources