Modulo multiplication (in C)

Modulo multiplication (in C) - c

Specifically: I have two unsigned integers (a,b) and I want to calculate (a*b)%UINT_MAX (UINT_MAX is defined as the maximal unsigned int). What is the best way to do so?
Background: I need to write a module for linux that will emulate a geometric sequence, reading from it will give me the next element (modulo UINT_MAX), the only solution I found is to add the current element to itself times, while adding is done using the following logic:(that I use for the arithmetic sequence)
for(int i=0; i<b; ++i){
if(UINT_MAX - current_value > difference) {
current_value += difference;
} else {
current_value = difference - (UINT_MAX - current_value);
}
when current_value = a in the first iteration (and is updated in every iteration, and difference = a (always).
Obviously this is not a intelligent solution.
How would an intelligent person achieve this?
Thanks!

As has been mentioned, if you have a type of twice the width available, just use that, here
(unsigned int)(((unsigned long long)a * b) % UINT_MAX)
if int is 32 bits and long long 64 (or more). If you have no larger type, you can split the factors at half the bit-width, multiply and reduce the parts, finally assemble it. Illustrated for 32-bit unsigned here:
a_low = a & 0xFFFF; // low 16 bits of a
a_high = a >> 16; // high 16 bits of a, shifted in low half
b_low = b & 0xFFFF;
b_high = b >> 16;
/*
* Now a = (a_high * 65536 + a_low), b = (b_high * 65536 + b_low)
* Thus a*b = (a_high * b_high) * 65536 * 65536
* + (a_high * b_low + a_low * b_high) * 65536
* + a_low * b_low
*
* All products a_i * b_j are at most (65536 - 1) * (65536 - 1) = UINT_MAX - 2 * 65536 + 2
* The high product reduces to
* (a_high * b_high) * (UINT_MAX + 1) = (a_high * b_high)
* The middle products are a bit trickier, but splitting again solves:
* m1 = a_high * b_low;
* m1_low = m1 & 0xFFFF;
* m1_high = m1 >> 16;
* Then m1 * 65536 = m1_high * (UINT_MAX + 1) + m1_low * 65536 = m1_high + m1_low * 65536
* Similar for a_low * b_high
* Finally, add the parts and take care of overflow
*/
m1 = a_high * b_low;
m2 = a_low * b_high;
m1_low = m1 & 0xFFFF;
m1_high = m1 >> 16;
m2_low = m2 & 0xFFFF;
m2_high = m2 >> 16;
result = a_high * b_high;
temp = result + ((m1_low << 16) | m1_high);
if (temp < result) // overflow
{
result = temp+1;
}
else
{
result = temp;
}
if (result == UINT_MAX)
{
result = 0;
}
// I'm too lazy to type out the rest, you get the gist, I suppose.
Of course, if what you need is actually reduction modulo UINT_MAX + 1, as #Toad assumes,then that's just what multiplication of unsigned int does.

EDIT: As pointed out in the comments... this answer applies to Modulo MAX_INT+1
I'll leave it standing here, for future reference.
It's much simpeler than that:
Just multiply the two unsigned ints, the result will also be an unsigned int.
Everything which didn't fit in the unsigned int is basically not there.
So no need to do a modulo operation:
See example here
#include <stdio.h>
void main()
{
unsigned int a,b;
a = 0x90000000;
b = 2;
unsigned int c = a*b;
printf("Answer is %X\r\n", c);
}
answer is: 0x20000000 (so it should have been 0x120000000, but the answer has been truncated, just what you wanted with the modulo operation)

Related

Abort on multiplication overflow [duplicate]

I am looking for an efficient (optionally standard, elegant and easy to implement) solution to multiply relatively large numbers, and store the result into one or several integers :
Let say I have two 64 bits integers declared like this :
uint64_t a = xxx, b = yyy;
When I do a * b, how can I detect if the operation results in an overflow and in this case store the carry somewhere?
Please note that I don't want to use any large-number library since I have constraints on the way I store the numbers.

1. Detecting the overflow:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}
Edit: Fixed division by 0 (thanks Mark!)
2. Computing the carry is quite involved. One approach is to split both operands into half-words, then apply long multiplication to the half-words:
uint64_t hi(uint64_t x) {
return x >> 32;
}
uint64_t lo(uint64_t x) {
return ((1ULL << 32) - 1) & x;
}
void multiply(uint64_t a, uint64_t b) {
// actually uint32_t would do, but the casting is annoying
uint64_t s0, s1, s2, s3;
uint64_t x = lo(a) * lo(b);
s0 = lo(x);
x = hi(a) * lo(b) + hi(x);
s1 = lo(x);
s2 = hi(x);
x = s1 + lo(a) * hi(b);
s1 = lo(x);
x = s2 + hi(a) * hi(b) + hi(x);
s2 = lo(x);
s3 = hi(x);
uint64_t result = s1 << 32 | s0;
uint64_t carry = s3 << 32 | s2;
}
To see that none of the partial sums themselves can overflow, we consider the worst case:
x = s2 + hi(a) * hi(b) + hi(x)
Let B = 1 << 32. We then have
x <= (B - 1) + (B - 1)(B - 1) + (B - 1)
<= B*B - 1
< B*B
I believe this will work - at least it handles Sjlver's test case. Aside from that, it is untested (and might not even compile, as I don't have a C++ compiler at hand anymore).

The idea is to use following fact which is true for integral operation:
a*b > c if and only if a > c/b
/ is integral division here.
The pseudocode to check against overflow for positive numbers follows:
if (a > max_int64 / b) then "overflow" else "ok".
To handle zeroes and negative numbers you should add more checks.
C code for non-negative a and b follows:
if (b > 0 && a > 18446744073709551615 / b) {
// overflow handling
}; else {
c = a * b;
}
Note, max value for 64 type:
18446744073709551615 == (1<<64)-1
To calculate carry we can use approach to split number into two 32-digits and multiply them as we do this on the paper. We need to split numbers to avoid overflow.
Code follows:
// split input numbers into 32-bit digits
uint64_t a0 = a & ((1LL<<32)-1);
uint64_t a1 = a >> 32;
uint64_t b0 = b & ((1LL<<32)-1);
uint64_t b1 = b >> 32;
// The following 3 lines of code is to calculate the carry of d1
// (d1 - 32-bit second digit of result, and it can be calculated as d1=d11+d12),
// but to avoid overflow.
// Actually rewriting the following 2 lines:
// uint64_t d1 = (a0 * b0 >> 32) + a1 * b0 + a0 * b1;
// uint64_t c1 = d1 >> 32;
uint64_t d11 = a1 * b0 + (a0 * b0 >> 32);
uint64_t d12 = a0 * b1;
uint64_t c1 = (d11 > 18446744073709551615 - d12) ? 1 : 0;
uint64_t d2 = a1 * b1 + c1;
uint64_t carry = d2; // needed carry stored here

Although there have been several other answers to this question, I several of them have code that is completely untested, and thus far no one has adequately compared the different possible options.
For that reason, I wrote and tested several possible implementations (the last one is based on this code from OpenBSD, discussed on Reddit here). Here's the code:
/* Multiply with overflow checking, emulating clang's builtin function
*
* __builtin_umull_overflow
*
* This code benchmarks five possible schemes for doing so.
*/
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#ifndef BOOL
#define BOOL int
#endif
// Option 1, check for overflow a wider type
// - Often fastest and the least code, especially on modern compilers
// - When long is a 64-bit int, requires compiler support for 128-bits
// ints (requires GCC >= 3.0 or Clang)
#if LONG_BIT > 32
typedef __uint128_t long_overflow_t ;
#else
typedef uint64_t long_overflow_t;
#endif
BOOL
umull_overflow1(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
long_overflow_t prod = (long_overflow_t)lhs * (long_overflow_t)rhs;
*result = (unsigned long) prod;
return (prod >> LONG_BIT) != 0;
}
// Option 2, perform long multiplication using a smaller type
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow2(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long bot_bits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = bot_bits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long mid_bits1 = lhs_low * rhs_high;
unsigned long mid_bits2 = lhs_high * rhs_low;
*result = bot_bits + ((mid_bits1+mid_bits2) << LONG_BIT/2);
return overflowed || *result < bot_bits
|| (mid_bits1 >> LONG_BIT/2) != 0
|| (mid_bits2 >> LONG_BIT/2) != 0;
}
// Option 3, perform long multiplication using a smaller type (this code is
// very similar to option 2, but calculates overflow using a different but
// equivalent method).
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call; clang likes this code).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow3(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long lowbits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = lowbits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long midbits1 = lhs_low * rhs_high;
unsigned long midbits2 = lhs_high * rhs_low;
unsigned long midbits = midbits1 + midbits2;
overflowed = overflowed || midbits < midbits1 || midbits > HALFSIZE_MAX;
unsigned long product = lowbits + (midbits << LONG_BIT/2);
overflowed = overflowed || product < lowbits;
*result = product;
return overflowed;
}
// Option 4, checks for overflow using division
// - Checks for overflow using division
// - Division is slow, especially if it is a library call
BOOL
umull_overflow4(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
*result = lhs * rhs;
return rhs > 0 && (SIZE_MAX / rhs) < lhs;
}
// Option 5, checks for overflow using division
// - Checks for overflow using division
// - Avoids division when the numbers are "small enough" to trivially
// rule out overflow
// - Division is slow, especially if it is a library call
BOOL
umull_overflow5(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long MUL_NO_OVERFLOW = (1ul << LONG_BIT/2) - 1ul;
*result = lhs * rhs;
return (lhs >= MUL_NO_OVERFLOW || rhs >= MUL_NO_OVERFLOW) &&
rhs > 0 && SIZE_MAX / rhs < lhs;
}
#ifndef umull_overflow
#define umull_overflow2
#endif
/*
* This benchmark code performs a multiply at all bit sizes,
* essentially assuming that sizes are logarithmically distributed.
*/
int main()
{
unsigned long i, j, k;
int count = 0;
unsigned long mult;
unsigned long total = 0;
for (k = 0; k < 0x40000000 / LONG_BIT / LONG_BIT; ++k)
for (i = 0; i != LONG_MAX; i = i*2+1)
for (j = 0; j != LONG_MAX; j = j*2+1) {
count += umull_overflow(i+k, j+k, &mult);
total += mult;
}
printf("%d overflows (total %lu)\n", count, total);
}
Here are the results, testing with various compilers and systems I have (in this case, all testing was done on OS X, but results should be similar on BSD or Linux systems):
+------------------+----------+----------+----------+----------+----------+
| | Option 1 | Option 2 | Option 3 | Option 4 | Option 5 |
| | BigInt | LngMult1 | LngMult2 | Div | OptDiv |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 i386 | 1.610 | 3.217 | 3.129 | 4.405 | 4.398 |
| GCC 4.9.0 i386 | 1.488 | 3.469 | 5.853 | 4.704 | 4.712 |
| GCC 4.2.1 i386 | 2.842 | 4.022 | 3.629 | 4.160 | 4.696 |
| GCC 4.2.1 PPC32 | 8.227 | 7.756 | 7.242 | 20.632 | 20.481 |
| GCC 3.3 PPC32 | 5.684 | 9.804 | 11.525 | 21.734 | 22.517 |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 x86_64 | 1.584 | 2.472 | 2.449 | 9.246 | 7.280 |
| GCC 4.9 x86_64 | 1.414 | 2.623 | 4.327 | 9.047 | 7.538 |
| GCC 4.2.1 x86_64 | 2.143 | 2.618 | 2.750 | 9.510 | 7.389 |
| GCC 4.2.1 PPC64 | 13.178 | 8.994 | 8.567 | 37.504 | 29.851 |
+------------------+----------+----------+----------+----------+----------+
Based on these results, we can draw a few conclusions:
Clearly, the division-based approach, although simple and portable, is slow.
No technique is a clear winner in all cases.
On modern compilers, the use-a-larger-int approach is best, if you can use it
On older compilers, the long-multiplication approach is best
Surprisingly, GCC 4.9.0 has performance regressions over GCC 4.2.1, and GCC 4.2.1 has performance regressions over GCC 3.3

A version that also works when a == 0:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}

Easy and fast with clang and gcc:
unsigned long long t a, b, result;
if (__builtin_umulll_overflow(a, b, &result)) {
// overflow!!
}
This will use hardware support for overflow detection where available. By being compiler extensions it can even handle signed integer overflow (replace umul with smul), eventhough that is undefined behavior in C++.

If you need not just to detect overflow but also to capture the carry, you're best off breaking your numbers down into 32-bit parts. The code is a nightmare; what follows is just a sketch:
#include <stdint.h>
uint64_t mul(uint64_t a, uint64_t b) {
uint32_t ah = a >> 32;
uint32_t al = a; // truncates: now a = al + 2**32 * ah
uint32_t bh = b >> 32;
uint32_t bl = b; // truncates: now b = bl + 2**32 * bh
// a * b = 2**64 * ah * bh + 2**32 * (ah * bl + bh * al) + al * bl
uint64_t partial = (uint64_t) al * (uint64_t) bl;
uint64_t mid1 = (uint64_t) ah * (uint64_t) bl;
uint64_t mid2 = (uint64_t) al * (uint64_t) bh;
uint64_t carry = (uint64_t) ah * (uint64_t) bh;
// add high parts of mid1 and mid2 to carry
// add low parts of mid1 and mid2 to partial, carrying
// any carry bits into carry...
}
The problem is not just the partial products but the fact that any of the sums can overflow.
If I had to do this for real, I would write an extended-multiply routine in the local assembly language. That is, for example, multiply two 64-bit integers to get a 128-bit result, which is stored in two 64-bit registers. All reasonable hardware provides this functionality in a single native multiply instruction—it's not just accessible from C.
This is one of those rare cases where the solution that's most elegant and easy to program is actually to use assembly language. But it's certainly not portable :-(

The GNU Portability Library (Gnulib) contains a module intprops, which has macros that efficiently test whether arithmetic operations would overflow.
For example, if an overflow in multiplication would occur, INT_MULTIPLY_OVERFLOW (a, b) would yield 1.

Perhaps the best way to solve this problem is to have a function, which multiplies two UInt64 and results a pair of UInt64, an upper part and a lower part of the UInt128 result. Here is the solution, including a function, which displays the result in hex. I guess you perhaps prefer a C++ solution, but I have a working Swift-Solution which shows, how to manage the problem:
func hex128 (_ hi: UInt64, _ lo: UInt64) -> String
{
var s: String = String(format: "%08X", hi >> 32)
+ String(format: "%08X", hi & 0xFFFFFFFF)
+ String(format: "%08X", lo >> 32)
+ String(format: "%08X", lo & 0xFFFFFFFF)
return (s)
}
func mul64to128 (_ multiplier: UInt64, _ multiplicand : UInt64)
-> (result_hi: UInt64, result_lo: UInt64)
{
let x: UInt64 = multiplier
let x_lo: UInt64 = (x & 0xffffffff)
let x_hi: UInt64 = x >> 32
let y: UInt64 = multiplicand
let y_lo: UInt64 = (y & 0xffffffff)
let y_hi: UInt64 = y >> 32
let mul_lo: UInt64 = (x_lo * y_lo)
let mul_hi: UInt64 = (x_hi * y_lo) + (mul_lo >> 32)
let mul_carry: UInt64 = (x_lo * y_hi) + (mul_hi & 0xffffffff)
let result_hi: UInt64 = (x_hi * y_hi) + (mul_hi >> 32) + (mul_carry >> 32)
let result_lo: UInt64 = (mul_carry << 32) + (mul_lo & 0xffffffff)
return (result_hi, result_lo)
}
Here is an example to verify, that the function works:
var c: UInt64 = 0
var d: UInt64 = 0
(c, d) = mul64to128(0x1234567890123456, 0x9876543210987654)
// 0AD77D742CE3C72E45FD10D81D28D038 is the result of the above example
print(hex128(c, d))
(c, d) = mul64to128(0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF)
// FFFFFFFFFFFFFFFE0000000000000001 is the result of the above example
print(hex128(c, d))

There is a simple (and often very fast solution) which has not been mentioned yet. The solution is based on the fact that n-Bit times m-Bit multiplication does never overflow for a product width of n+m-bit or higher but overflows for all result widths smaller than n+m-1.
Because my old description might have been too difficult to read for some people, I try it again:
What you need is checking the sum of leading-zeroes of both operands. It would be very easy to prove mathematically.
Let x be n-Bit and y be m-Bit. z = x * y is k-Bit. Because the product can be n+m bit large at most it can overflow. Let's say. x*y is p-Bit long (without leading zeroes). The leading zeroes of the product are clz(x * y) = n+m - p. clz behaves similar to log, hence:
clz(x * y) = clz(x) + clz(y) + c with c = either 1 or 0.
(thank you for the c = 1 advice in the comment!)
It overflows when k < p <= n+m <=> n+m - k > n+m - p = clz(x * y).
Now we can use this algorithm:
if max(clz(x * y)) = clz(x) + clz(y) +1 < (n+m - k) --> overflow
if max(clz(x * y)) = clz(x) + clz(y) +1 == (n+m - k) --> overflow if c = 0
else --> no overflow
How to check for overflow in the middle case? I assume, you have a multiplication instruction. Then we easily can use it to see the leading zeroes of the result, i.e.:
if clz(x * y / 2) == (n+m - k) <=> msb(x * y/2) == 1 --> overflow
else --> no overflow
You do the multiplication by treating x/2 as fixed point and y as normal integer:
msb(x * y/2) = msb(floor(x * y / 2))
floor(x * y/2) = floor(x/2) * y + (lsb(x) * floor(y/2)) = (x >> 1)*y + (x & 1)*(y >> 1)
(this result never overflows in case of clz(x)+clz(y)+1 == (n+m -k))
The trick is using builtins/intrinsics. In GCC it looks this way:
static inline int clz(int a) {
if (a == 0) return 32; //only needed for x86 architecture
return __builtin_clz(a);
}
/**#fn static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2)
* #return one, if a 32-Bit-overflow occurs when unsigned-unsigned-multipliying f1 with f2 otherwise zero. */
static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2) {
int lzsum = clz(f1) + clz(f2); //leading zero sum
return
lzsum < sizeof(f1)*8-1 || ( //if too small, overflow guaranteed
lzsum == sizeof(f1)*8-1 && //if special case, do further check
(int32_t)((f1 >> 1)*f2 + (f1 & 1)*(f2 >> 1)) < 0 //check product rightshifted by one
);
}
...
if (chk_mul_ov(f1, f2)) {
//error handling
}
...
Just an example for n = m = k = 32-Bit unsigned-unsigned-multiplication. You can generalize it to signed-unsigned- or signed-signed-multiplication. And even no multiple-bit-shift is required (because some microcontrollers implement one-bit-shifts only but sometimes support product divided by two with a single instruction like Atmega!). However, if no count-leading-zeroes instruction exists but long multiplication, this might not be better.
Other compilers have their own way of specifying intrinsics for CLZ operations.
Compared to checking upper half of the multiplication the clz-method should scale better (in worst case) than using a highly optimized 128-Bit multiplication to check for 64-Bit overflow. Multiplication needs over linear overhead while count bits needs only linear overhead.
This code worked out-of-the box for me when tried.

I've been working with this problem this days and I have to say that it has impressed me the number of times I have seen people saying the best way to know if there has been an overflow is to divide the result, thats totally inefficient and unnecessary. The point for this function is that it must be as fast as possible.
There are two options for the overflow detection:
1º- If possible create the result variable twice as big as the multipliers, for example:
struct INT32struct {INT16 high, low;};
typedef union
{
struct INT32struct s;
INT32 ll;
} INT32union;
INT16 mulFunction(INT16 a, INT16 b)
{
INT32union result.ll = a * b; //32Bits result
if(result.s.high > 0)
Overflow();
return (result.s.low)
}
You will know inmediately if there has been an overflow, and the code is the fastest possible without writing it in machine code. Depending on the compiler this code can be improved in machine code.
2º- Is impossible to create a result variable twice as big as the multipliers variable:
Then you should play with if conditions to determine the best path. Continuing with the example:
INT32 mulFunction(INT32 a, INT32 b)
{
INT32union s_a.ll = abs(a);
INT32union s_b.ll = abs(b); //32Bits result
INT32union result;
if(s_a.s.hi > 0 && s_b.s.hi > 0)
{
Overflow();
}
else if (s_a.s.hi > 0)
{
INT32union res1.ll = s_a.s.hi * s_b.s.lo;
INT32union res2.ll = s_a.s.lo * s_b.s.lo;
if (res1.hi == 0)
{
result.s.lo = res1.s.lo + res2.s.hi;
if (result.s.hi == 0)
{
result.s.ll = result.s.lo << 16 + res2.s.lo;
if ((a.s.hi >> 15) ^ (b.s.hi >> 15) == 1)
{
result.s.ll = -result.s.ll;
}
return result.s.ll
}else
{
Overflow();
}
}else
{
Overflow();
}
}else if (s_b.s.hi > 0)
{
//Same code changing a with b
}else
{
return (s_a.lo * s_b.lo);
}
}
I hope this code helps you to have a quite efficient program and I hope the code is clear, if not I'll put some coments.
best regards.

Here is a trick for detecting whether multiplication of two unsigned integers overflows.
We make the observation that if we multiply an N-bit-wide binary number with an M-bit-wide binary number, the product does not have more than N + M bits.
For instance, if we are asked to multiply a three-bit number with a twenty-nine bit number, we know that this doesn't overflow thirty-two bits.
#include <stdlib.h>
#include <stdio.h>
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
a = a | (a >> 1) | (a >> 2) | (a >> 4) | (a >> 8) | (a >> 16) | (a >> 32);
b = b | (b >> 1) | (b >> 2) | (b >> 4) | (b >> 8) | (b >> 16) | (b >> 32);
for (;;) {
unsigned long na = a << 1;
if (na <= a)
break;
a = na;
}
return (a & b) ? 1 : 0;
}
int main(int argc, char **argv)
{
unsigned long a, b;
char *endptr;
if (argc < 3) {
printf("supply two unsigned long integers in C form\n");
return EXIT_FAILURE;
}
a = strtoul(argv[1], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[1]);
return EXIT_FAILURE;
}
b = strtoul(argv[2], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[2]);
return EXIT_FAILURE;
}
if (might_be_mul_oflow(a, b))
printf("might be multiplication overflow\n");
{
unsigned long c = a * b;
printf("%lu * %lu = %lu\n", a, b, c);
if (a != 0 && c / a != b)
printf("confirmed multiplication overflow\n");
}
return 0;
}
A smattering of tests: (on 64 bit system):
$ ./uflow 0x3 0x3FFFFFFFFFFFFFFF
3 * 4611686018427387903 = 13835058055282163709
$ ./uflow 0x7 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
7 * 4611686018427387903 = 13835058055282163705
confirmed multiplication overflow
$ ./uflow 0x4 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
4 * 4611686018427387903 = 18446744073709551612
$ ./uflow 0x5 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
5 * 4611686018427387903 = 4611686018427387899
confirmed multiplication overflow
The steps in might_be_mul_oflow are almost certainly slower than just doing the division test, at least on mainstream processors used in desktop workstations, servers and mobile devices. On chips without good division support, it could be useful.
It occurs to me that there is another way to do this early rejection test.
We start with a pair of numbers arng and brng which are initialized to 0x7FFF...FFFF and 1.
If a <= arng and b <= brng we can conclude that there is no overflow.
Otherwise, we shift arng to the right, and shift brng to the left, adding one bit to brng, so that they are 0x3FFF...FFFF and 3.
If arng is zero, finish; otherwise repeat at 2.
The function now looks like:
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
{
unsigned long arng = ULONG_MAX >> 1;
unsigned long brng = 1;
while (arng != 0) {
if (a <= arng && b <= brng)
return 0;
arng >>= 1;
brng <<= 1;
brng |= 1;
}
return 1;
}
}

When your using e.g. 64 bits variables, implement 'number of significant bits' with nsb(var) = { 64 - clz(var); }.
clz(var) = count leading zeros in var, a builtin command for GCC and Clang, or probably available with inline assembly for your CPU.
Now use the fact that nsb(a * b) <= nsb(a) + nsb(b) to check for overflow. When smaller, it is always 1 smaller.
Ref GCC: Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

I was thinking about this today and stumbled upon this question, my thoughts led me to this result. TLDR, while I find it "elegant" in that it only uses a few lines of code (could easily be a one liner), and has some mild math that simplifies to something relatively simple conceptually, this is mostly "interesting" and I haven't tested it.
If you think of an unsigned integer as being a single digit with radix 2^n where n is the number of bits in the integer, then you can map those numbers to radians around the unit circle, e.g.
radians(x) = x * (2 * pi * rad / 2^n)
When the integer overflows, it is equivalent to wrapping around the circle. So calculating the carry is equivalent to calculating the number of times multiplication would wrap around the circle. To calculate the number of times we wrap around the circle we divide radians(x) by 2pi radians. e.g.
wrap(x) = radians(x) / (2*pi*rad)
= (x * (2*pi*rad / 2^n)) / (2*pi*rad / 1)
= (x * (2*pi*rad / 2^n)) * (1 / 2*pi*rad)
= x * 1 / 2^n
= x / 2^n
Which simplifies to
wrap(x) = x / 2^n
This makes sense. The number of times a number, for example, 15 with radix 10, wraps around is 15 / 10 = 1.5, or one and a half times. However, we can't use 2 digits here (assuming we are limited to a single 2^64 digit).
Say we have a * b, with radix R, we can calculate the carry with
Consider that: wrap(a * b) = a * wrap(b)
wrap(a * b) = (a * b) / R
a * wrap(b) = a * (b / R)
a * (b / R) = (a * b) / R
carry = floor(a * wrap(b))
Take for example a = 9 and b = 5, which are factors of 45 (i.e. 9 * 5 = 45).
wrap(5) = 5 / 10 = 0.5
a * wrap(5) = 9 * 0.5 = 4.5
carry = floor(9 * wrap(5)) = floor(4.5) = 4
Note that if the carry was 0, then we would not have had overflow, for example if a = 2, b=2.
In C/C++ (if the compiler and architecture supports it) we have to use long double.
Thus we have:
long double wrap = b / 18446744073709551616.0L; // this is b / 2^64
unsigned long carry = (unsigned long)(a * wrap); // floor(a * wrap(b))
bool overflow = carry > 0;
unsigned long c = a * b;
c here is the lower significant "digit", i.e. in base 10 9 * 9 = 81, carry = 8, and c = 1.
This was interesting to me in theory, so I thought I'd share it, but one major caveat is with the floating point precision in computers. Using long double, there may be rounding errors for some numbers when we calculate the wrap variable depending on how many significant digits your compiler/arch uses for long doubles, I believe it should be 20 more more to be sure. Another issue with this result, is that it may not perform as well as some of the other solutions simply by using floating points and division.

If you just want to detect overflow, how about converting to double, doing the multiplication and if
|x| < 2^53, convert to int64
|x| < 2^63, make the multiplication using int64
otherwise produce whatever error you want?
This seems to work:
int64_t safemult(int64_t a, int64_t b) {
double dx;
dx = (double)a * (double)b;
if ( fabs(dx) < (double)9007199254740992 )
return (int64_t)dx;
if ( (double)INT64_MAX < fabs(dx) )
return INT64_MAX;
return a*b;
}

Fast multiplication modulo 2^16 + 1

The IDEA cipher uses multiplication modulo 2^16 + 1. Is there an algorithm to perform this operation without general modulo operator (only modulo 2^16 (truncation))? In the context of IDEA, zero is interpreted as 2^16 (it means zero isn't an argument of our multiplication and it cannot be the result, so we can save one bit and store value 2^16 as bit pattern 0000000000000000). I am wondering how to implement it efficiently (or whether it is possible at all) without using the standard modulo operator.

You can utilize the fact, that (N-1) % N == -1.
Thus, (65536 * a) % 65537 == -a % 65537.
Also, -a % 65537 == -a + 1 (mod 65536), when 0 is interpreted as 65536
uint16_t fastmod65537(uint16_t a, uint16_t b)
{
uint32_t c;
uint16_t hi, lo;
if (a == 0)
return -b + 1;
if (b == 0)
return -a + 1;
c = (uint32_t)a * (uint32_t)b;
hi = c >> 16;
lo = c;
if (lo > hi)
return lo-hi;
return lo-hi+1;
}
The only problem here is if hi == lo, the result would be 0. Luckily a test suite confirms, that it actually can't be...
int main()
{
uint64_t a, b;
for (a = 1; a <= 65536; a++)
for (b = 1; b <= 65536; b++)
{
uint64_t c = a*b;
uint32_t d = (c % 65537) & 65535;
uint32_t e = m(a & 65535, b & 65535);
if (d != e)
printf("a * b % 65537 != m(%d, %d) real=%d m()=%d\n",
(uint32_t)a, (uint32_t)b, d, e);
}
}
Output: none

First, the case where either a or b is zero. In that case, it is interpreted as having the value 2^16, therefore elementary modulo arithmetic tells us that:
result = -a - b + 1;
, because (in the context of IDEA) the multiplicative inverse of 2^16 is still 2^16, and its lowest 16 bits are all zeroes.
The general case is much easier than it seems, now that we took care of the "0" special case (2^16+1 is 0x10001):
/* This operation can overflow: */
unsigned result = (product & 0xFFFF) - (product >> 16);
/* ..so account for cases of overflow: */
result -= result >> 16;
Putting it together:
/* All types must be sufficiently wide unsigned, e.g. uint32_t: */
unsigned long long product = a * b;
if (product == 0) {
return -a - b + 1;
} else {
result = (product & 0xFFFF) - (product >> 16);
result -= result >> 16;
return result & 0xFFFF;
}

How can I implement a modulo operation on unsigned ints with limited hardware in C

I have a machine which only supports 32 bit operations, long long does not work on this machine.
I have one 64 bit quantity represented as two unsigned int 32s.
The question is how can I perform a mod on that 64 bit quantity with a 32 bit divisor.
r = a mod b
where:
a is 64 bit value and b is 32 bit value
I was thinking that I could represent the mod part by doing:
a = a1 * (2 ^ 32) + a2 (where a1 is the top bits and a2 is the bottom bits)
(a1 * (2 ^32) + a2) mod b = ( (a1 * 2 ^ 32) mod b + a2 mod b) mod b
( (a1 * 2 ^ 32) mod b + a2 mod b) mod b = (a1 mod b * 2 ^ 32 mod b + a2 mod b) mod b
but the problem is that 2 ^ 32 mod b may sometimes be equal to 2 ^ 32 and therefore the multiplication will overflow. I have looked at attempting to convert the multiplication into an addition but that also requires me to use 2 ^ 32 which if I mod will again give me 2 ^ 32 :) so I am not sure how to perform an unsigned mod of a 64 bit value with a 32 bit one.
I guess a simple solution to this would be to perform the following operations:
a / b = c
a = a - floor(c) * b
perform 1 until c is equal to 0 and use a as the answer.
but I am not sure how to combine these two integers together to form the 64 bit value
Just to be complete here are some links for binary division and subtractions:
http://www.exploringbinary.com/binary-division/
and a description of binary division algorithm:
http://en.wikipedia.org/wiki/Division_algorithm

Works: Tested with 1000M random combinations against a 64-bit %.
Like grade school long division a/b (but in base 2), subtract b from a if possible, then shift, looping 64 times. Return the remainder.
#define MSBit 0x80000000L
uint32_t mod32(uint32_t a1 /* MSHalf */, uint32_t a2 /* LSHalf */, uint32_t b) {
uint32_t a = 0;
for (int i = 31+32; i >= 0; i--) {
if (a & MSBit) { // Note 1
a <<= 1;
a -= b;
} else {
a <<= 1;
}
if (a1 & MSBit) a++;
a1 <<= 1;
if (a2 & MSBit) a1++;
a2 <<= 1;
if (a >= b)
a -= b;
}
return a;
}
Note 1: This is the sneaky part to do a 33-bit subtraction. Since code knows n has the MSBit set, 2*n will be greater than b, then n = 2*n - b. This counts on unsigned wrap-around.
[Edit]
Below is a generic modu() that works with any array size a and any size unsigned integer.
#include <stdint.h>
#include <limits.h>
// Use any unpadded unsigned integer type
#define UINT uint32_t
#define BitSize (sizeof(UINT) * CHAR_BIT)
#define MSBit ((UINT)1 << (BitSize - 1))
UINT modu(const UINT *aarray, size_t alen, UINT b) {
UINT r = 0;
while (alen-- > 0) {
UINT a = aarray[alen];
for (int i = BitSize; i > 0; i--) {
UINT previous = r;
r <<= 1;
if (a & MSBit) {
r++;
}
a <<= 1;
if ((previous & MSBit) || (r >= b)) {
r -= b;
}
}
}
return r;
}
UINT modu2(UINT a1 /* MSHalf */, UINT a2 /* LSHalf */, UINT b) {
UINT a[] = { a2, a1 }; // Least significant at index 0
return modu(a, sizeof a / sizeof a[0], b);
}

Do it the same way you would do a long division with pencil and paper.
#include <stdio.h>
unsigned int numh = 0x12345678;
unsigned int numl = 0x456789AB;
unsigned int denom = 0x17234591;
int main() {
unsigned int numer, quotient, remain;
numer = numh >> 16;
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numh & 0xffff);
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numl >> 16);
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numl & 0xffff);
quotient = numer / denom;
remain = numer - quotient * denom;
printf("%X\n", remain);
return 0;
}

Modulo algorithm with byte array and 8 bit integer: 8bit = bytes % 8bit

I need to create an algorithm implemented in C that do modulo arithmetic between an arbitrary number of bytes and one byte. See this:
typedef struct{
u_int8_t * data;
u_int16_t length;
}UBigInt;
u_int8_t UBigIntModuloWithUInt8(UBigInt a,u_int8_t b){
}
For powers of two a & (b-1) can be used but what about non powers of two?
I realise one method is: a - b*(a/b)
That would require to use UBigIntDivisionWithUInt8 and UBigIntMultiplicationWithUInt8 and UBigIntSubtractionWithUBigInt. There might be a more efficient way to do this?
Thank you.
This is the implementation I now have:
u_int8_t UBigIntModuloWithUInt8(UBigInt a,u_int8_t b){
if (!(b & (b - 1)))
return a.data[a.length - 1] & b - 1; // For powers of two this can be done
// Wasn't a power of two.
u_int16_t result = 0; // Prevents overflow in calculations
for(int x = 0; x < a.length; x++) {
result *= (256 % b);
result %= b;
result += a.data[x] % b;
result %= b;
}
return result;
}

You can use a variation on Horner's method.
Process a byte by byte with this formula:
a % b = ((a // 256) % b) * (256 % b) + (a % 256) % b, where x // y is the rounding division (normal C integer division). The reason this will work is that congruence modulo b is an equivalence relation.
With this you have an O(length) algorithm, or O(log(a)).
Example snippet (untested, my C skills are rusty):
u_int16_t result = 0; // Just in case, to prevent overflow
for(i = 0, i<a.length; i++) {
result *= (256 % b);
result %= b;
result += (a[i] % b);
result %= b;
}
Some justification:
a = (a // 256) * 256 + (a % 256), therefore
a % b = ((a // 256) * 256) % b + ((a % 256) % b). However a % 256 = a[n-1] and a // 256 = a[0 .. n-2]. Reversing the actions in a way similar to Horner's rule gives you the presented snippet.

Catch and compute overflow during multiplication of two large integers

I am looking for an efficient (optionally standard, elegant and easy to implement) solution to multiply relatively large numbers, and store the result into one or several integers :
Let say I have two 64 bits integers declared like this :
uint64_t a = xxx, b = yyy;
When I do a * b, how can I detect if the operation results in an overflow and in this case store the carry somewhere?
Please note that I don't want to use any large-number library since I have constraints on the way I store the numbers.

1. Detecting the overflow:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}
Edit: Fixed division by 0 (thanks Mark!)
2. Computing the carry is quite involved. One approach is to split both operands into half-words, then apply long multiplication to the half-words:
uint64_t hi(uint64_t x) {
return x >> 32;
}
uint64_t lo(uint64_t x) {
return ((1ULL << 32) - 1) & x;
}
void multiply(uint64_t a, uint64_t b) {
// actually uint32_t would do, but the casting is annoying
uint64_t s0, s1, s2, s3;
uint64_t x = lo(a) * lo(b);
s0 = lo(x);
x = hi(a) * lo(b) + hi(x);
s1 = lo(x);
s2 = hi(x);
x = s1 + lo(a) * hi(b);
s1 = lo(x);
x = s2 + hi(a) * hi(b) + hi(x);
s2 = lo(x);
s3 = hi(x);
uint64_t result = s1 << 32 | s0;
uint64_t carry = s3 << 32 | s2;
}
To see that none of the partial sums themselves can overflow, we consider the worst case:
x = s2 + hi(a) * hi(b) + hi(x)
Let B = 1 << 32. We then have
x <= (B - 1) + (B - 1)(B - 1) + (B - 1)
<= B*B - 1
< B*B
I believe this will work - at least it handles Sjlver's test case. Aside from that, it is untested (and might not even compile, as I don't have a C++ compiler at hand anymore).

The idea is to use following fact which is true for integral operation:
a*b > c if and only if a > c/b
/ is integral division here.
The pseudocode to check against overflow for positive numbers follows:
if (a > max_int64 / b) then "overflow" else "ok".
To handle zeroes and negative numbers you should add more checks.
C code for non-negative a and b follows:
if (b > 0 && a > 18446744073709551615 / b) {
// overflow handling
}; else {
c = a * b;
}
Note, max value for 64 type:
18446744073709551615 == (1<<64)-1
To calculate carry we can use approach to split number into two 32-digits and multiply them as we do this on the paper. We need to split numbers to avoid overflow.
Code follows:
// split input numbers into 32-bit digits
uint64_t a0 = a & ((1LL<<32)-1);
uint64_t a1 = a >> 32;
uint64_t b0 = b & ((1LL<<32)-1);
uint64_t b1 = b >> 32;
// The following 3 lines of code is to calculate the carry of d1
// (d1 - 32-bit second digit of result, and it can be calculated as d1=d11+d12),
// but to avoid overflow.
// Actually rewriting the following 2 lines:
// uint64_t d1 = (a0 * b0 >> 32) + a1 * b0 + a0 * b1;
// uint64_t c1 = d1 >> 32;
uint64_t d11 = a1 * b0 + (a0 * b0 >> 32);
uint64_t d12 = a0 * b1;
uint64_t c1 = (d11 > 18446744073709551615 - d12) ? 1 : 0;
uint64_t d2 = a1 * b1 + c1;
uint64_t carry = d2; // needed carry stored here

Although there have been several other answers to this question, I several of them have code that is completely untested, and thus far no one has adequately compared the different possible options.
For that reason, I wrote and tested several possible implementations (the last one is based on this code from OpenBSD, discussed on Reddit here). Here's the code:
/* Multiply with overflow checking, emulating clang's builtin function
*
* __builtin_umull_overflow
*
* This code benchmarks five possible schemes for doing so.
*/
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#ifndef BOOL
#define BOOL int
#endif
// Option 1, check for overflow a wider type
// - Often fastest and the least code, especially on modern compilers
// - When long is a 64-bit int, requires compiler support for 128-bits
// ints (requires GCC >= 3.0 or Clang)
#if LONG_BIT > 32
typedef __uint128_t long_overflow_t ;
#else
typedef uint64_t long_overflow_t;
#endif
BOOL
umull_overflow1(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
long_overflow_t prod = (long_overflow_t)lhs * (long_overflow_t)rhs;
*result = (unsigned long) prod;
return (prod >> LONG_BIT) != 0;
}
// Option 2, perform long multiplication using a smaller type
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow2(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long bot_bits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = bot_bits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long mid_bits1 = lhs_low * rhs_high;
unsigned long mid_bits2 = lhs_high * rhs_low;
*result = bot_bits + ((mid_bits1+mid_bits2) << LONG_BIT/2);
return overflowed || *result < bot_bits
|| (mid_bits1 >> LONG_BIT/2) != 0
|| (mid_bits2 >> LONG_BIT/2) != 0;
}
// Option 3, perform long multiplication using a smaller type (this code is
// very similar to option 2, but calculates overflow using a different but
// equivalent method).
// - Sometimes the fastest (e.g., when mulitply on longs is a library
// call; clang likes this code).
// - Performs at most three multiplies, and sometimes only performs one.
// - Highly portable code; works no matter how many bits unsigned long is
BOOL
umull_overflow3(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long HALFSIZE_MAX = (1ul << LONG_BIT/2) - 1ul;
unsigned long lhs_high = lhs >> LONG_BIT/2;
unsigned long lhs_low = lhs & HALFSIZE_MAX;
unsigned long rhs_high = rhs >> LONG_BIT/2;
unsigned long rhs_low = rhs & HALFSIZE_MAX;
unsigned long lowbits = lhs_low * rhs_low;
if (!(lhs_high || rhs_high)) {
*result = lowbits;
return 0;
}
BOOL overflowed = lhs_high && rhs_high;
unsigned long midbits1 = lhs_low * rhs_high;
unsigned long midbits2 = lhs_high * rhs_low;
unsigned long midbits = midbits1 + midbits2;
overflowed = overflowed || midbits < midbits1 || midbits > HALFSIZE_MAX;
unsigned long product = lowbits + (midbits << LONG_BIT/2);
overflowed = overflowed || product < lowbits;
*result = product;
return overflowed;
}
// Option 4, checks for overflow using division
// - Checks for overflow using division
// - Division is slow, especially if it is a library call
BOOL
umull_overflow4(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
*result = lhs * rhs;
return rhs > 0 && (SIZE_MAX / rhs) < lhs;
}
// Option 5, checks for overflow using division
// - Checks for overflow using division
// - Avoids division when the numbers are "small enough" to trivially
// rule out overflow
// - Division is slow, especially if it is a library call
BOOL
umull_overflow5(unsigned long lhs, unsigned long rhs, unsigned long* result)
{
const unsigned long MUL_NO_OVERFLOW = (1ul << LONG_BIT/2) - 1ul;
*result = lhs * rhs;
return (lhs >= MUL_NO_OVERFLOW || rhs >= MUL_NO_OVERFLOW) &&
rhs > 0 && SIZE_MAX / rhs < lhs;
}
#ifndef umull_overflow
#define umull_overflow2
#endif
/*
* This benchmark code performs a multiply at all bit sizes,
* essentially assuming that sizes are logarithmically distributed.
*/
int main()
{
unsigned long i, j, k;
int count = 0;
unsigned long mult;
unsigned long total = 0;
for (k = 0; k < 0x40000000 / LONG_BIT / LONG_BIT; ++k)
for (i = 0; i != LONG_MAX; i = i*2+1)
for (j = 0; j != LONG_MAX; j = j*2+1) {
count += umull_overflow(i+k, j+k, &mult);
total += mult;
}
printf("%d overflows (total %lu)\n", count, total);
}
Here are the results, testing with various compilers and systems I have (in this case, all testing was done on OS X, but results should be similar on BSD or Linux systems):
+------------------+----------+----------+----------+----------+----------+
| | Option 1 | Option 2 | Option 3 | Option 4 | Option 5 |
| | BigInt | LngMult1 | LngMult2 | Div | OptDiv |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 i386 | 1.610 | 3.217 | 3.129 | 4.405 | 4.398 |
| GCC 4.9.0 i386 | 1.488 | 3.469 | 5.853 | 4.704 | 4.712 |
| GCC 4.2.1 i386 | 2.842 | 4.022 | 3.629 | 4.160 | 4.696 |
| GCC 4.2.1 PPC32 | 8.227 | 7.756 | 7.242 | 20.632 | 20.481 |
| GCC 3.3 PPC32 | 5.684 | 9.804 | 11.525 | 21.734 | 22.517 |
+------------------+----------+----------+----------+----------+----------+
| Clang 3.5 x86_64 | 1.584 | 2.472 | 2.449 | 9.246 | 7.280 |
| GCC 4.9 x86_64 | 1.414 | 2.623 | 4.327 | 9.047 | 7.538 |
| GCC 4.2.1 x86_64 | 2.143 | 2.618 | 2.750 | 9.510 | 7.389 |
| GCC 4.2.1 PPC64 | 13.178 | 8.994 | 8.567 | 37.504 | 29.851 |
+------------------+----------+----------+----------+----------+----------+
Based on these results, we can draw a few conclusions:
Clearly, the division-based approach, although simple and portable, is slow.
No technique is a clear winner in all cases.
On modern compilers, the use-a-larger-int approach is best, if you can use it
On older compilers, the long-multiplication approach is best
Surprisingly, GCC 4.9.0 has performance regressions over GCC 4.2.1, and GCC 4.2.1 has performance regressions over GCC 3.3

A version that also works when a == 0:
x = a * b;
if (a != 0 && x / a != b) {
// overflow handling
}

Easy and fast with clang and gcc:
unsigned long long t a, b, result;
if (__builtin_umulll_overflow(a, b, &result)) {
// overflow!!
}
This will use hardware support for overflow detection where available. By being compiler extensions it can even handle signed integer overflow (replace umul with smul), eventhough that is undefined behavior in C++.

If you need not just to detect overflow but also to capture the carry, you're best off breaking your numbers down into 32-bit parts. The code is a nightmare; what follows is just a sketch:
#include <stdint.h>
uint64_t mul(uint64_t a, uint64_t b) {
uint32_t ah = a >> 32;
uint32_t al = a; // truncates: now a = al + 2**32 * ah
uint32_t bh = b >> 32;
uint32_t bl = b; // truncates: now b = bl + 2**32 * bh
// a * b = 2**64 * ah * bh + 2**32 * (ah * bl + bh * al) + al * bl
uint64_t partial = (uint64_t) al * (uint64_t) bl;
uint64_t mid1 = (uint64_t) ah * (uint64_t) bl;
uint64_t mid2 = (uint64_t) al * (uint64_t) bh;
uint64_t carry = (uint64_t) ah * (uint64_t) bh;
// add high parts of mid1 and mid2 to carry
// add low parts of mid1 and mid2 to partial, carrying
// any carry bits into carry...
}
The problem is not just the partial products but the fact that any of the sums can overflow.
If I had to do this for real, I would write an extended-multiply routine in the local assembly language. That is, for example, multiply two 64-bit integers to get a 128-bit result, which is stored in two 64-bit registers. All reasonable hardware provides this functionality in a single native multiply instruction—it's not just accessible from C.
This is one of those rare cases where the solution that's most elegant and easy to program is actually to use assembly language. But it's certainly not portable :-(

The GNU Portability Library (Gnulib) contains a module intprops, which has macros that efficiently test whether arithmetic operations would overflow.
For example, if an overflow in multiplication would occur, INT_MULTIPLY_OVERFLOW (a, b) would yield 1.

Perhaps the best way to solve this problem is to have a function, which multiplies two UInt64 and results a pair of UInt64, an upper part and a lower part of the UInt128 result. Here is the solution, including a function, which displays the result in hex. I guess you perhaps prefer a C++ solution, but I have a working Swift-Solution which shows, how to manage the problem:
func hex128 (_ hi: UInt64, _ lo: UInt64) -> String
{
var s: String = String(format: "%08X", hi >> 32)
+ String(format: "%08X", hi & 0xFFFFFFFF)
+ String(format: "%08X", lo >> 32)
+ String(format: "%08X", lo & 0xFFFFFFFF)
return (s)
}
func mul64to128 (_ multiplier: UInt64, _ multiplicand : UInt64)
-> (result_hi: UInt64, result_lo: UInt64)
{
let x: UInt64 = multiplier
let x_lo: UInt64 = (x & 0xffffffff)
let x_hi: UInt64 = x >> 32
let y: UInt64 = multiplicand
let y_lo: UInt64 = (y & 0xffffffff)
let y_hi: UInt64 = y >> 32
let mul_lo: UInt64 = (x_lo * y_lo)
let mul_hi: UInt64 = (x_hi * y_lo) + (mul_lo >> 32)
let mul_carry: UInt64 = (x_lo * y_hi) + (mul_hi & 0xffffffff)
let result_hi: UInt64 = (x_hi * y_hi) + (mul_hi >> 32) + (mul_carry >> 32)
let result_lo: UInt64 = (mul_carry << 32) + (mul_lo & 0xffffffff)
return (result_hi, result_lo)
}
Here is an example to verify, that the function works:
var c: UInt64 = 0
var d: UInt64 = 0
(c, d) = mul64to128(0x1234567890123456, 0x9876543210987654)
// 0AD77D742CE3C72E45FD10D81D28D038 is the result of the above example
print(hex128(c, d))
(c, d) = mul64to128(0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF)
// FFFFFFFFFFFFFFFE0000000000000001 is the result of the above example
print(hex128(c, d))

There is a simple (and often very fast solution) which has not been mentioned yet. The solution is based on the fact that n-Bit times m-Bit multiplication does never overflow for a product width of n+m-bit or higher but overflows for all result widths smaller than n+m-1.
Because my old description might have been too difficult to read for some people, I try it again:
What you need is checking the sum of leading-zeroes of both operands. It would be very easy to prove mathematically.
Let x be n-Bit and y be m-Bit. z = x * y is k-Bit. Because the product can be n+m bit large at most it can overflow. Let's say. x*y is p-Bit long (without leading zeroes). The leading zeroes of the product are clz(x * y) = n+m - p. clz behaves similar to log, hence:
clz(x * y) = clz(x) + clz(y) + c with c = either 1 or 0.
(thank you for the c = 1 advice in the comment!)
It overflows when k < p <= n+m <=> n+m - k > n+m - p = clz(x * y).
Now we can use this algorithm:
if max(clz(x * y)) = clz(x) + clz(y) +1 < (n+m - k) --> overflow
if max(clz(x * y)) = clz(x) + clz(y) +1 == (n+m - k) --> overflow if c = 0
else --> no overflow
How to check for overflow in the middle case? I assume, you have a multiplication instruction. Then we easily can use it to see the leading zeroes of the result, i.e.:
if clz(x * y / 2) == (n+m - k) <=> msb(x * y/2) == 1 --> overflow
else --> no overflow
You do the multiplication by treating x/2 as fixed point and y as normal integer:
msb(x * y/2) = msb(floor(x * y / 2))
floor(x * y/2) = floor(x/2) * y + (lsb(x) * floor(y/2)) = (x >> 1)*y + (x & 1)*(y >> 1)
(this result never overflows in case of clz(x)+clz(y)+1 == (n+m -k))
The trick is using builtins/intrinsics. In GCC it looks this way:
static inline int clz(int a) {
if (a == 0) return 32; //only needed for x86 architecture
return __builtin_clz(a);
}
/**#fn static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2)
* #return one, if a 32-Bit-overflow occurs when unsigned-unsigned-multipliying f1 with f2 otherwise zero. */
static inline _Bool chk_mul_ov(uint32_t f1, uint32_t f2) {
int lzsum = clz(f1) + clz(f2); //leading zero sum
return
lzsum < sizeof(f1)*8-1 || ( //if too small, overflow guaranteed
lzsum == sizeof(f1)*8-1 && //if special case, do further check
(int32_t)((f1 >> 1)*f2 + (f1 & 1)*(f2 >> 1)) < 0 //check product rightshifted by one
);
}
...
if (chk_mul_ov(f1, f2)) {
//error handling
}
...
Just an example for n = m = k = 32-Bit unsigned-unsigned-multiplication. You can generalize it to signed-unsigned- or signed-signed-multiplication. And even no multiple-bit-shift is required (because some microcontrollers implement one-bit-shifts only but sometimes support product divided by two with a single instruction like Atmega!). However, if no count-leading-zeroes instruction exists but long multiplication, this might not be better.
Other compilers have their own way of specifying intrinsics for CLZ operations.
Compared to checking upper half of the multiplication the clz-method should scale better (in worst case) than using a highly optimized 128-Bit multiplication to check for 64-Bit overflow. Multiplication needs over linear overhead while count bits needs only linear overhead.
This code worked out-of-the box for me when tried.

I've been working with this problem this days and I have to say that it has impressed me the number of times I have seen people saying the best way to know if there has been an overflow is to divide the result, thats totally inefficient and unnecessary. The point for this function is that it must be as fast as possible.
There are two options for the overflow detection:
1º- If possible create the result variable twice as big as the multipliers, for example:
struct INT32struct {INT16 high, low;};
typedef union
{
struct INT32struct s;
INT32 ll;
} INT32union;
INT16 mulFunction(INT16 a, INT16 b)
{
INT32union result.ll = a * b; //32Bits result
if(result.s.high > 0)
Overflow();
return (result.s.low)
}
You will know inmediately if there has been an overflow, and the code is the fastest possible without writing it in machine code. Depending on the compiler this code can be improved in machine code.
2º- Is impossible to create a result variable twice as big as the multipliers variable:
Then you should play with if conditions to determine the best path. Continuing with the example:
INT32 mulFunction(INT32 a, INT32 b)
{
INT32union s_a.ll = abs(a);
INT32union s_b.ll = abs(b); //32Bits result
INT32union result;
if(s_a.s.hi > 0 && s_b.s.hi > 0)
{
Overflow();
}
else if (s_a.s.hi > 0)
{
INT32union res1.ll = s_a.s.hi * s_b.s.lo;
INT32union res2.ll = s_a.s.lo * s_b.s.lo;
if (res1.hi == 0)
{
result.s.lo = res1.s.lo + res2.s.hi;
if (result.s.hi == 0)
{
result.s.ll = result.s.lo << 16 + res2.s.lo;
if ((a.s.hi >> 15) ^ (b.s.hi >> 15) == 1)
{
result.s.ll = -result.s.ll;
}
return result.s.ll
}else
{
Overflow();
}
}else
{
Overflow();
}
}else if (s_b.s.hi > 0)
{
//Same code changing a with b
}else
{
return (s_a.lo * s_b.lo);
}
}
I hope this code helps you to have a quite efficient program and I hope the code is clear, if not I'll put some coments.
best regards.

Here is a trick for detecting whether multiplication of two unsigned integers overflows.
We make the observation that if we multiply an N-bit-wide binary number with an M-bit-wide binary number, the product does not have more than N + M bits.
For instance, if we are asked to multiply a three-bit number with a twenty-nine bit number, we know that this doesn't overflow thirty-two bits.
#include <stdlib.h>
#include <stdio.h>
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
a = a | (a >> 1) | (a >> 2) | (a >> 4) | (a >> 8) | (a >> 16) | (a >> 32);
b = b | (b >> 1) | (b >> 2) | (b >> 4) | (b >> 8) | (b >> 16) | (b >> 32);
for (;;) {
unsigned long na = a << 1;
if (na <= a)
break;
a = na;
}
return (a & b) ? 1 : 0;
}
int main(int argc, char **argv)
{
unsigned long a, b;
char *endptr;
if (argc < 3) {
printf("supply two unsigned long integers in C form\n");
return EXIT_FAILURE;
}
a = strtoul(argv[1], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[1]);
return EXIT_FAILURE;
}
b = strtoul(argv[2], &endptr, 0);
if (*endptr != 0) {
printf("%s is garbage\n", argv[2]);
return EXIT_FAILURE;
}
if (might_be_mul_oflow(a, b))
printf("might be multiplication overflow\n");
{
unsigned long c = a * b;
printf("%lu * %lu = %lu\n", a, b, c);
if (a != 0 && c / a != b)
printf("confirmed multiplication overflow\n");
}
return 0;
}
A smattering of tests: (on 64 bit system):
$ ./uflow 0x3 0x3FFFFFFFFFFFFFFF
3 * 4611686018427387903 = 13835058055282163709
$ ./uflow 0x7 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
7 * 4611686018427387903 = 13835058055282163705
confirmed multiplication overflow
$ ./uflow 0x4 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
4 * 4611686018427387903 = 18446744073709551612
$ ./uflow 0x5 0x3FFFFFFFFFFFFFFF
might be multiplication overflow
5 * 4611686018427387903 = 4611686018427387899
confirmed multiplication overflow
The steps in might_be_mul_oflow are almost certainly slower than just doing the division test, at least on mainstream processors used in desktop workstations, servers and mobile devices. On chips without good division support, it could be useful.
It occurs to me that there is another way to do this early rejection test.
We start with a pair of numbers arng and brng which are initialized to 0x7FFF...FFFF and 1.
If a <= arng and b <= brng we can conclude that there is no overflow.
Otherwise, we shift arng to the right, and shift brng to the left, adding one bit to brng, so that they are 0x3FFF...FFFF and 3.
If arng is zero, finish; otherwise repeat at 2.
The function now looks like:
int might_be_mul_oflow(unsigned long a, unsigned long b)
{
if (!a || !b)
return 0;
{
unsigned long arng = ULONG_MAX >> 1;
unsigned long brng = 1;
while (arng != 0) {
if (a <= arng && b <= brng)
return 0;
arng >>= 1;
brng <<= 1;
brng |= 1;
}
return 1;
}
}

When your using e.g. 64 bits variables, implement 'number of significant bits' with nsb(var) = { 64 - clz(var); }.
clz(var) = count leading zeros in var, a builtin command for GCC and Clang, or probably available with inline assembly for your CPU.
Now use the fact that nsb(a * b) <= nsb(a) + nsb(b) to check for overflow. When smaller, it is always 1 smaller.
Ref GCC: Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

I was thinking about this today and stumbled upon this question, my thoughts led me to this result. TLDR, while I find it "elegant" in that it only uses a few lines of code (could easily be a one liner), and has some mild math that simplifies to something relatively simple conceptually, this is mostly "interesting" and I haven't tested it.
If you think of an unsigned integer as being a single digit with radix 2^n where n is the number of bits in the integer, then you can map those numbers to radians around the unit circle, e.g.
radians(x) = x * (2 * pi * rad / 2^n)
When the integer overflows, it is equivalent to wrapping around the circle. So calculating the carry is equivalent to calculating the number of times multiplication would wrap around the circle. To calculate the number of times we wrap around the circle we divide radians(x) by 2pi radians. e.g.
wrap(x) = radians(x) / (2*pi*rad)
= (x * (2*pi*rad / 2^n)) / (2*pi*rad / 1)
= (x * (2*pi*rad / 2^n)) * (1 / 2*pi*rad)
= x * 1 / 2^n
= x / 2^n
Which simplifies to
wrap(x) = x / 2^n
This makes sense. The number of times a number, for example, 15 with radix 10, wraps around is 15 / 10 = 1.5, or one and a half times. However, we can't use 2 digits here (assuming we are limited to a single 2^64 digit).
Say we have a * b, with radix R, we can calculate the carry with
Consider that: wrap(a * b) = a * wrap(b)
wrap(a * b) = (a * b) / R
a * wrap(b) = a * (b / R)
a * (b / R) = (a * b) / R
carry = floor(a * wrap(b))
Take for example a = 9 and b = 5, which are factors of 45 (i.e. 9 * 5 = 45).
wrap(5) = 5 / 10 = 0.5
a * wrap(5) = 9 * 0.5 = 4.5
carry = floor(9 * wrap(5)) = floor(4.5) = 4
Note that if the carry was 0, then we would not have had overflow, for example if a = 2, b=2.
In C/C++ (if the compiler and architecture supports it) we have to use long double.
Thus we have:
long double wrap = b / 18446744073709551616.0L; // this is b / 2^64
unsigned long carry = (unsigned long)(a * wrap); // floor(a * wrap(b))
bool overflow = carry > 0;
unsigned long c = a * b;
c here is the lower significant "digit", i.e. in base 10 9 * 9 = 81, carry = 8, and c = 1.
This was interesting to me in theory, so I thought I'd share it, but one major caveat is with the floating point precision in computers. Using long double, there may be rounding errors for some numbers when we calculate the wrap variable depending on how many significant digits your compiler/arch uses for long doubles, I believe it should be 20 more more to be sure. Another issue with this result, is that it may not perform as well as some of the other solutions simply by using floating points and division.

If you just want to detect overflow, how about converting to double, doing the multiplication and if
|x| < 2^53, convert to int64
|x| < 2^63, make the multiplication using int64
otherwise produce whatever error you want?
This seems to work:
int64_t safemult(int64_t a, int64_t b) {
double dx;
dx = (double)a * (double)b;
if ( fabs(dx) < (double)9007199254740992 )
return (int64_t)dx;
if ( (double)INT64_MAX < fabs(dx) )
return INT64_MAX;
return a*b;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Modulo multiplication (in C) - c

Related

Abort on multiplication overflow [duplicate]

Fast multiplication modulo 2^16 + 1

How can I implement a modulo operation on unsigned ints with limited hardware in C

Modulo algorithm with byte array and 8 bit integer: 8bit = bytes % 8bit

Catch and compute overflow during multiplication of two large integers

Categories

Resources