In-place integer multiplication

In-place integer multiplication - c

I'm writing a program (in C) in which I try to calculate powers of big numbers in an as short of a period as possible. The numbers I represent as vectors of digits, so all operations have to be written by hand.
The program would be much faster without all the allocations and deallocations of intermediary results. Is there any algorithm for doing integer multiplication, in-place? For example, the function
void BigInt_Times(BigInt *a, const BigInt *b);
would place the result of the multiplication of a and b inside of a, without using an intermediary value.

Here, muln() is 2n (really, n) by n = 2n in-place multiplication for unsigned integers. You can adjust it to operate with 32-bit or 64-bit "digits" instead of 8-bit. The modulo operator is left in for clarity.
muln2() is n by n = n in-place multiplication (as hinted here), also operating on 8-bit "digits".
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
typedef unsigned char uint8;
typedef unsigned short uint16;
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned uint;
void muln(uint8* dst/* n bytes + n extra bytes for product */,
const uint8* src/* n bytes */,
uint n)
{
uint c1, c2;
memset(dst + n, 0, n);
for (c1 = 0; c1 < n; c1++)
{
uint8 carry = 0;
for (c2 = 0; c2 < n; c2++)
{
uint16 p = dst[c1] * src[c2] + carry + dst[(c1 + n + c2) % (2 * n)];
dst[(c1 + n + c2) % (2 * n)] = (uint8)(p & 0xFF);
carry = (uint8)(p >> 8);
}
dst[c1] = carry;
}
for (c1 = 0; c1 < n; c1++)
{
uint8 t = dst[c1];
dst[c1] = dst[n + c1];
dst[n + c1] = t;
}
}
void muln2(uint8* dst/* n bytes */,
const uint8* src/* n bytes */,
uint n)
{
uint c1, c2;
if (n >= 0xFFFF) abort();
for (c1 = n - 1; c1 != ~0u; c1--)
{
uint16 s = 0;
uint32 p = 0; // p must be able to store ceil(log2(n))+2*8 bits
for (c2 = c1; c2 != ~0u; c2--)
{
p += dst[c2] * src[c1 - c2];
}
dst[c1] = (uint8)(p & 0xFF);
for (c2 = c1 + 1; c2 < n; c2++)
{
p >>= 8;
s += dst[c2] + (uint8)(p & 0xFF);
dst[c2] = (uint8)(s & 0xFF);
s >>= 8;
}
}
}
int main(void)
{
uint8 a[4] = { 0xFF, 0xFF, 0x00, 0x00 };
uint8 b[2] = { 0xFF, 0xFF };
printf("0x%02X%02X * 0x%02X%02X = ", a[1], a[0], b[1], b[0]);
muln(a, b, 2);
printf("0x%02X%02X%02X%02X\n", a[3], a[2], a[1], a[0]);
a[0] = -2; a[1] = -1;
b[0] = -3; b[1] = -1;
printf("0x%02X%02X * 0x%02X%02X = ", a[1], a[0], b[1], b[0]);
muln2(a, b, 2);
printf("0x%02X%02X\n", a[1], a[0]);
return 0;
}
Output:
0xFFFF * 0xFFFF = 0xFFFE0001
0xFFFE * 0xFFFD = 0x0006
I think this is the best we can do in-place. One thing I don't like about muln2() is that it has to accumulate bigger intermediate products and then propagate a bigger carry.

Well, the standard algorithm consists of multiplying every digit (word) of 'a' with every digit of 'b' and summing them into the appropriate places in the result. The i'th digit of a thus goes into every digit from i to i+n of the result. So in order to do this 'in place' you need to calculate the output digits down from most significant to least. This is a little bit trickier than doing it from least to most, but not much...

It doesn't sound like you really need an algorithm. Rather, you need better use of the language's features.
Why not just create that function you indicated in your answer? Use it and enjoy! (The function would likely end up returning a reference to a as its result.)

Typically, big-int representations vary in length depending on the value represented; in general, the result is going to be longer than either operand. In particular, for multiplication, the size of the resulting representation is roughly the sum of the sizes of the arguments.
If you are certain that memory management is truly the bottleneck for your particular platform, you might consider implementing a multiply function which updates a third value. In terms of your C-style function prototype above:
void BigInt_Times_Update(const BigInt* a, const BigInt* b, BigInt* target);
That way, you can handle memory management in the same way C++ std::vector<> containers do: your update target only needs to reallocate its heap data when the existing size is too small.

Related

How to do 1024-bit operations using arrays of uint64_t

I am trying to find a way to compute values that are of type uint1024_t (unsigned 1024-bit integer), by defining the 5 basic operations: plus, minus, times, divide, modulus.
The way that I can do that is by creating a structure that will have the following prototype:
typedef struct {
uint64_t chunk[16];
} uint1024_t;
Now since it is complicated to wrap my head around such operations with uint64_t as block size, I have first written some code for manipulating uint8_t. Here is what I came up with:
#define UINT8_HI(x) (x >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & x)
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2;
uint8_t x = UINT8_LO(a) + UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) + UINT8_HI(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
uint8_t result = s0 + (s1 << 4);
uint8_t carry = s2;
res[1 + i] = result;
res[0 + i] = carry;
}
void uint8_multiply(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint8_t s0, s1, s2, s3;
uint8_t x = UINT8_LO(a) * UINT8_LO(b);
s0 = UINT8_LO(x);
x = UINT8_HI(a) * UINT8_LO(b) + UINT8_HI(x);
s1 = UINT8_LO(x);
s2 = UINT8_HI(x);
x = s1 + UINT8_LO(a) * UINT8_HI(b);
s1 = UINT8_LO(x);
x = s2 + UINT8_HI(a) * UINT8_HI(b) + UINT8_HI(x);
s2 = UINT8_LO(x);
s3 = UINT8_HI(x);
uint8_t result = s1 << 4 | s0;
uint8_t carry = s3 << 4 | s2;
res[1 + i] = result;
res[0 + i] = carry;
}
And it seems to work just fine, however I am unable to define the same operations for division, subtraction and modulus...
Furthermore I just can't seem to see how to implement the same principal to my custom uint1024_t structure even though it is pretty much identical with a few lines of code more to manage overflows.
I would really appreciate some help in implementing the 5 basic operations for my structure.
EDIT:
I have answered below with my implementation for resolving this problem.

find a way to compute ... the 5 basic operations: plus, minus, times, divide, modulus.
If uint1024_t used uint32_t, it would be easier.
I would recommend 1) half the width of the widest type uintmax_t, or 2) unsigned, whichever is smaller. E.g. 32-bit.
(Also consider something other than uintN_t to avoid collisions with future versions of C.)
typedef struct {
uint32_t chunk[1024/32];
} u1024;
Example of some untested code to give OP an idea of how using uint32_t simplifies the task.
void u1024_mult(u1024 *product, const u1024 *a, const u1024 *b) {
memset(product, 0, sizeof product[0]);
unsigned n = sizeof product->chunk / sizeof product->chunk[0];
for (unsigned ai = 0; ai < n; ai++) {
uint64_t acc = 0;
uint32_t m = a->chunk[ai];
for (unsigned bi = 0; ai + bi < n; bi++) {
acc += (uint64_t) m * b->chunk[bi] + product->chunk[ai + bi];
product->chunk[ai + bi] = (uint32_t) acc;
acc >>= 32;
}
}
}
+, - are quite similar to the above.
/, % could be combined into one routine that computes the quotient and remainder together.
It is not that hard to post those functions here as it really is the same as grade school math, but instead of base 10, base 232. I am against posting it though as it is fun exercise to do oneself.
I hope the * sample code above inspires rather than answers.

There are some problems with your implementation for uint8_t arrays:
you did not parenthesize the macro arguments in the expansion. This is very error prone as it may cause unexpected operator precedence problems if the arguments are expressions. You should write:
#define UINT8_HI(x) ((x) >> 4)
#define UINT8_LO(x) (((1 << 4) - 1) & (x))
storing the array elements with the most significant part first is counter intuitive. Multi-precision arithmetics usually represents the large values as arrays with the least significant part first.
for a small type such as uint8_t, there is no need to split it into halves as larger types are available. Furthermore, you must propagate the carry from the previous addition. Here is a much simpler implementation for the addition:
void uint8_add(uint8_t a, uint8_t b, uint8_t *res, int i) {
uint16_t result = a + b + res[i + 0]; // add previous carry
res[i + 0] = (uint8_t)result;
res[i + 1] = (uint8_t)(result >> 8); // assuming res has at least i+1 elements and is initialized to 0
}
for the multiplication, you must add the result of multiplying each part of each number to the appropriately chosen parts of the result number, propagating the carry to the higher parts.
Division is more difficult to implement efficiently. I recommend you study an open source multi-precision package such as QuickJS' libbf.c.
To transpose this to arrays of uint64_t, you can use unsigned 128-bit integer types if available on your platform (64-bit compilers gcc, clang and vsc all support such types).
Here is a simple implementation for the addition and multiplication:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#define NB_CHUNK 16
typedef __uint128_t uint128_t;
typedef struct {
uint64_t chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
uint128_t result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (uint128_t)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
uint128_t result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (uint128_t)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (uint64_t)result;
result >>= CHAR_BIT * sizeof(uint64_t);
}
}
}
If 128-bit integers are not available, your 1024-bit type could be implemented as an array of 32-bit integers. Here is a flexible implementation with selectable types for the array elements and the intermediary result:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#if 1 // if platform has 128 bit integers
typedef uint64_t type1;
typedef __uint128_t type2;
#else
typedef uint32_t type1;
typedef uint64_t type2;
#endif
#define TYPE1_BITS (CHAR_BIT * sizeof(type1))
#define NB_CHUNK (1024 / TYPE1_BITS)
typedef struct uint1024_t {
type1 chunk[NB_CHUNK];
} uint1024_t;
void uint0124_add(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
type2 result = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
result += (type2)a->chunk[i] + b->chunk[i];
dest->chunk[i] = (type1)result;
result >>= TYPE1_BITS;
}
}
void uint0124_multiply(uint1024_t *dest, const uint1024_t *a, const uint1024_t *b) {
for (size_t i = 0; i < NB_CHUNK; i++)
dest->chunk[i] = 0;
for (size_t i = 0; i < NB_CHUNK; i++) {
type2 result = 0;
for (size_t j = 0, k = i; k < NB_CHUNK; j++, k++) {
result += (type2)a->chunk[i] * b->chunk[j] + dest->chunk[k];
dest->chunk[k] = (type1)result;
result >>= TYPE1_BITS;
}
}
}

How to multiply 2 uint8 modulo a big number without using integer type in C language [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
If A and B are of type uint8_t and I want the result C=AxB % N where N is 2^16, how do i do this if I can't use integers (so I can't declare N as an integer, only uint8_t) in C language?
N.B: A, B and C are stored in uint8 arrays, so they are "expressed" as uint8 but their values can be bigger.

In general there is no easy way to do this.
Firstly you need to implement the multiply with carry between A and B for each uint8_t block. See the answer here.
Division with 2^16 really mean "disregard" the last 16 bits, "don't use" the last two uint8_t (as you use the array of int.). As you have the modulus operator, this means just the opposite, so you only need to get the last two uint8_ts.
Take the lowest two uint8 of A (say a0 and a1) and B (say b0 and b1):
split each uint8 in high and low part
a0h = a0 >> 4; ## the same as a0h = a0/16;
a0l = a0 % 16; ## the same as a0l = a0 & 0x0f;
a1h = a1 >> 4;
a1l = a1 % 16;
b0h = b0 >> 4;
b0l = b0 % 16;
b1h = b1 >> 4;
b1l = b1 % 16;
Multiply the lower parts first (x is a buffer var)
x = a0l * b0l;
The first part of the result is the last four bits of x, let's call it s0l
s0l = x % 16;
The top for bits of x are carry.
c = x>>4;
multiply the higher parts of first uint8 and add carry.
x = (a0h * b0h) + c;
The first part of the result is the last four bits of x, let's call it s0h. And we need to get carry again.
s0h = x % 16;
c = x>>4;
We can now combine the s0:
s0 = (s0h << 4) + s0l;
Do exactly the same for the s1 (but don't forget to add the carry!):
x = (a1l * b1l) + c;
s1l = x % 16;
c = x>>4;
x = (a1h * b1h) + c;
s1h = x % 16;
c = x>>4;
s1 = (s1h << 4) + s1l;
Your result at this point is c, s1 and s0 (you need carry for next multiplications eg. s2, s3, s4,). As your formula says %(2^16) you already have your result - s1 and s2. If you have to divide with something else, you should do something similar to the code above, but for division. In this case be careful to catch the dividing with zero, it will give you NAN or something!
You can put A, B, C and S in array and loop it through the indexes to make code cleaner.

Here's my effort. I took the liberty of using larger integers and pointers for looping through the arrays. The numbers are represented by arrays of uint8_t in big-endian order. All the intermediate results are kept in uint8_t variables. The code could be made more efficient if intermediate results could be stored in wider integer variables!
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
static void add_c(uint8_t *r, size_t len_r, uint8_t x)
{
uint8_t o;
while (len_r--) {
o = r[len_r];
r[len_r] += x;
if (o <= r[len_r])
break;
x = 1;
}
}
void multiply(uint8_t *res, size_t len_res,
const uint8_t *a, size_t len_a, const uint8_t *b, size_t len_b)
{
size_t ia, ib, ir;
for (ir = 0; ir < len_res; ir++)
res[ir] = 0;
for (ia = 0; ia < len_a && ia < len_res; ia++) {
uint8_t ah, al, t;
t = a[len_a - ia - 1];
ah = t >> 4;
al = t & 0xf;
for (ib = 0; ib < len_b && ia + ib < len_res; ib++) {
uint8_t bh, bl, x, o, c0, c1;
t = b[len_b - ib - 1];
bh = t >> 4;
bl = t & 0xf;
c0 = al * bl;
c1 = ah * bh;
o = c0;
t = al * bh;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
o = c0;
t = ah * bl;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
add_c(res, len_res - ia - ib, c0);
add_c(res, len_res - ia - ib - 1, c1);
}
}
}
int main(void)
{
uint8_t a[2] = { 0xee, 0xdd };
uint8_t b[2] = { 0xcc, 0xbb };
uint8_t r[4];
multiply(r, sizeof(r), a, sizeof(a), b, sizeof(b));
printf("0x%02X%02X * 0x%02X%02X = 0x%02X%02X%02X%02X\n",
a[0], a[1], b[0], b[1], r[0], r[1], r[2], r[3]);
return 0;
}
Output:
0xEEDD * 0xCCBB = 0xBF06976F

unsigned to hex digit

I got a problem that says: Form a character array based on an unsigned int. Array will represent that int in hexadecimal notation. Do this using bitwise operators.
So, my ideas is the following: I create a mask that has 1's for its 4 lowest value bits.
I push the bits of the given int by 4 to the right and use & on that int and mask. I repeat until (int != 0). My question is: when I get individual hex digits (packs of 4 bits), how do I convert them to a char? For example, I get:
x & mask = 1101(2) = 13(10) = D(16)
Is there a function to convert an int to hex representation, or do I have to use brute force with switch statement or whatever else?
I almost forgot, I am doing this in C :)
Here is what I mean:
#include <stdio.h>
#include <stdlib.h>
#define BLOCK 4
int main() {
unsigned int x, y, i, mask;
char a[4];
printf("Enter a positive number: ");
scanf("%u", &x);
for (i = sizeof(usnsigned int), mask = ~(~0 << 4); x; i--, x >>= BLOCK) {
y = x & mask;
a[i] = FICTIVE_NUM_TO_HEX_DIGIT(y);
}
print_array(a);
return EXIT_SUCCESS;
}

You are almost there. The simplest method to convert an integer in the range from 0 to 15 to a hexadecimal digit is to use a lookup table,
char hex_digits[] = "0123456789ABCDEF";
and index into that,
a[i] = hex_digits[y];
in your code.
Remarks:
char a[4];
is probably too small. One hexadecimal digit corresponds to four bits, so with CHAR_BIT == 8, you need up to 2*sizeof(unsigned) chars to represent the number, generally, (CHAR_BIT * sizeof(unsigned int) + 3) / 4. Depending on what print_array does, you may need to 0-terminate a.
for (i = sizeof(usnsigned int), mask = ~(~0 << 4); x; i--, x >>= BLOCK)
initialising i to sizeof(unsigned int) skips the most significant bits, i should be initialised to the last valid index into a (except for possibly the 0-terminator, then the penultimate valid index).
The mask can more simply be defined as mask = 0xF, that has the added benefit of not invoking undefined behaviour, which
mask = ~(~0 << 4)
probably does. 0 is an int, and thus ~0 is one too. On two's complement machines (that is almost everything nowadays), the value is -1, and shifting negative integers left is undefined behaviour.

char buffer[10] = {0};
int h = 17;
sprintf(buffer, "%02X", h);

Try something like this:
char hex_digits[] = "0123456789ABCDEF";
for (i = 0; i < ((sizeof(unsigned int) * CHAR_BIT + 3) / 4); i++) {
digit = (x >> (sizeof(unsigned int) * CHAR_BIT - 4)) & 0x0F;
x = x << 4;
a[i] = hex_digits[digit];
}

Ok, this is where I got:
#include <stdio.h>
#include <stdlib.h>
#define BLOCK 4
void printArray(char*, int);
int main() {
unsigned int x, mask;
int size = sizeof(unsigned int) * 2, i;
char a[size], hexDigits[] = "0123456789ABCDEF";
for (i = 0; i < size; i++)
a[i] = 0;
printf("Enter a positive number: ");
scanf("%u", &x);
for (i = size - 1, mask = ~(~0 << 4); x; i--, x >>= BLOCK) {
a[i] = hexDigits[x & mask];
}
printArray(a, size);
return EXIT_SUCCESS;
}
void printArray(char a[], int n) {
int i;
for (i = 0; i < n; i++)
printf("%c", a[i]);
putchar('\n');
}
I have compiled, it runs and it does the job correctly. I don't know... Should I be worried that this problem was a bit hard for me? At faculty, during exams, we must write our code by hand, on a piece of paper... I don't imagine I would have done this right.
Is there a better (less complicated) way to do this problem? Thank you all for help :)

I would consider the impact of potential padding bits when shifting, as shifting by anything equal to or greater than the number of value bits that exist in an integer type is undefined behaviour.
Perhaps you could terminate the string first using: array[--size] = '\0';, write the smallest nibble (hex digit) using array[--size] = "0123456789ABCDEF"[value & 0x0f], move onto the next nibble using: value >>= 4, and repeat while value > 0. When you're done, return array + size or &array[size] so that the caller knows where the hex sequence begins.

detecting multiplication of uint64_t integers overflow with C

Is there any efficient and portable way to check when multiplication operations with int64_t or uint64_t operands overflow in C?
For instance, for addition of uint64_t I can do:
if (UINT64_MAX - a < b) overflow_detected();
else sum = a + b;
But I can not get to a similar simple expression for multiplication.
All that occurs to me is breaking the operands into high and low uint32_t parts and performing the multiplication of those parts while checking for overflow, something really ugly and probably inefficient too.
Update 1: Some benchmark code implementing several approaches added
Update 2: Jens Gustedt method added
benchmarking program:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define N 100000000
int d = 2;
#define POW_2_64 ((double)(1 << 31) * (double)(1 << 31) * 4)
#define calc_b (a + c)
// #define calc_b (a + d)
int main(int argc, char *argv[]) {
uint64_t a;
uint64_t c = 0;
int o = 0;
int opt;
if (argc != 2) exit(1);
opt = atoi(argv[1]);
switch (opt) {
case 1: /* faked check, just for timing */
for (a = 0; a < N; a++) {
uint64_t b = a + c;
if (c > a) o++;
c += b * a;
}
break;
case 2: /* using division */
for (a = 0; a < N; a++) {
uint64_t b = a + c;
if (b && (a > UINT64_MAX / b)) o++;
c += b * a;
}
break;
case 3: /* using floating point, unreliable */
for (a = 0; a < N; a++) {
uint64_t b = a + c;
if ((double)UINT64_MAX < (double)a * (double)b) o++;
c += b * a;
}
break;
case 4: /* using floating point and division for difficult cases */
for (a = 0; a < N; a++) {
uint64_t b = a + c;
double m = (double)a * (double)b;
if ( ((double)(~(uint64_t)(0xffffffff)) < m ) &&
( (POW_2_64 < m) ||
( b &&
(a > UINT64_MAX / b) ) ) ) o++;
c += b * a;
}
break;
case 5: /* Jens Gustedt method */
for (a = 0; a < N; a++) {
uint64_t b = a + c;
uint64_t a1, b1;
if (a > b) { a1 = a; b1 = b; }
else { a1 = b; b1 = a; }
if (b1 > 0xffffffff) o++;
else {
uint64_t a1l = (a1 & 0xffffffff) * b1;
uint64_t a1h = (a1 >> 32) * b1 + (a1l >> 32);
if (a1h >> 32) o++;
}
c += b1 * a1;
}
break;
default:
exit(2);
}
printf("c: %lu, o: %u\n", c, o);
}
So far, case 4 that uses floating point to filter most cases is the fastest when it is assumed that overflows are very unusual, at least on my computer where it is only two times slower than the do-nothing case.
Case 5, is 30% slower than 4, but it always performs the same, there isn't any special case numbers that require slower processing as happens with 4.

Actually, the same principle can be used for multiplication:
uint64_t a;
uint64_t b;
...
if (b != 0 && a > UINT64_MAX / b) { // if you multiply by b, you get: a * b > UINT64_MAX
< error >
}
uint64_t c = a * b;
For signed integers similar can be done, you'd probably need a case for each combination of signs.

If you want to avoid division as in Ambroz' answer:
First you have to see that the smaller of the two numbers, say a, is less than 232, otherwise the result will overflow anyhow. Let b be decomposed into the two 32 bit words that is b = c 232 + d.
The computation then is not so difficult, I find:
uint64_t mult_with_overflow_check(uint64_t a, uint64_t b) {
if (a > b) return mult_with_overflow_check(b, a);
if (a > UINT32_MAX) overflow();
uint32_t c = b >> 32;
uint32_t d = UINT32_MAX & b;
uint64_t r = a * c;
uint64_t s = a * d;
if (r > UINT32_MAX) overflow();
r <<= 32;
return addition_with_overflow_check(s, r);
}
so this are two multiplications, two shifts, some additions and condition checks. This could be more efficient than the division because e.g the two multiplications can be pipelined in paralle. You'd have to benchmark to see what works better for you.

Related question with some (hopefully) useful answers: Best way to detect integer overflow in C/C++. Plus it not covers uint64_t only ;)

case 6:
for (a = 0; a < N; a++) {
uint64_t b = a + c;
uint64_t a1, b1;
if (a > b) { a1 = a; b1 = b; }
else { a1 = b; b1 = a; }
uint64_t cc = b1 * a1;
c += cc;
if (b1 > 0xffffffff) o++;
else {
uint64_t a1l = (a1 & 0xffffffff) + (a1 >> 32);
a1l = (a1 + (a1 >> 32)) & 0xffffffff;
uint64_t ab1l = a1l * b1;
ab1l = (ab1l & 0xffffffff) + (ab1l >> 32);
ab1l += (ab1l >> 32);
uint64_t ccl = (cc & 0xffffffff) + (cc >> 32);
ccl += (ccl >> 32);
uint32_t ab32 = ab1l; if (ab32 == 0xffffffff) ab32 = 0;
uint32_t cc32 = ccl; if (cc32 == 0xffffffff) cc32 = 0;
if (ab32 != cc32) o++;
}
}
break;
This method compares (possibly overflowing) result of normal multiplication with the result of multiplication, which cannot overflow. All calculations are modulo (2^32 - 1).
It is more complicated and (most likely) not faster than Jens Gustedt's method.
After some small modifications it may multiply with 96-bit precision (but without overflow control). What may be more interesting, the idea of this method may be used to check overflow for a series of arithmetic operations (multiplications, additions, subtractions).
Some questions answered
First of all, about "your code is not portable". Yes, code is not portable because it uses uint64_t, which is requested in the original question. Strictly speaking, you cannot get any portable answer with (u)int64_t because it is not required by standard.
About "once some overflow happens, you can not assume the result value to be anything". Standard says that unsigned itegers cannot overflow. Chapter 6.2.5, item 9:
A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
So unsigned 64-bit multiplication is performed modulo 2^64, without overflow.
Now about the "logic behind". "Hash function" is not the correct word here. I only use calculations modulo (2^32 - 1). The result of multiplication may be represented as n*2^64 + m, where m is the visible result, and n means how much we overflow. Since 2^64 = 1 (mod 2^32 - 1), we may calculate [true value] - [visible value] = (n*2^64 + m) - m = n*2^64 = n (mod 2^32 - 1). If calculated value of n is not zero, there is an overflow. If it is zero, there is no overflow. Any collisions are possible only after n >= 2^32 - 1. This will never happen since we check that one of the multiplicands is less than 2^32.

It might not detect exact overflows, but in general you can test the result of your multiplication on a logarithmic scale:
if (log(UINT64_MAX-1) - log(a) - log(b) < 0) overflow_detected(); // subtracting 1 to allow some tolerance when the numbers are converted to double
else prod = a * b;
It depends if you really need to do multiplication up to exact UINT64_MAX, otherwise this a very portable and convenient way to check multiplications of large numbers.

Also consider using your compiler's built-in functions:
bool __builtin_mul_overflow (type1 a, type2 b, type3 *res)
this function/macro is defined in at least gcc and clang (I haven't checked others):
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
https://clang.llvm.org/docs/LanguageExtensions.html
Clang provides a set of builtins that implement checked arithmetic for security critical applications in a manner that is fast and easily expressible in C
This answer here would have helped me a few weeks back, but I did finally find a great answer which goes into more detail about the builtins:
https://stackoverflow.com/a/20956705/7113685

Fastest way to calculate possible values of unsigned int with N unreliable bits?

Given an unsigned int A (32 bit), and another unsigned int B, where B's binary form denotes the 10 "least reliable" bits of A, what is the fastest way to expand all 1024 potential values of A? I'm looking to do this in C.
E.g uint B is guaranteed to always have 10 1's and 22 0's in it's binary form (10 least reliable bits).
For example, let's say
A = 2323409845
B = 1145324694
Their binary representations are:
a=10001010011111000110101110110101
b=01000100010001000100010010010110
B denotes the 10 least reliable bits of A. So each bit that is set to 1 in B denotes an unreliable bit in A.
I would like to calculate all 1024 possible values created by toggling any of those 10 bits in A.

No guarantees that this is certifiably "the fastest", but this is what I'd do. First, sieve out the fixed bits:
uint32_t const reliable_mask = ~B;
uint32_t const reliable_value = A & reliable_mask;
Now I'd preprocess an array of 1024 possible values of the unreliable bits:
uint32_t const unreliables[1024] = /* ... */
And finally I'd just OR all those together:
for (size_t i = 0; i != 1024; ++i)
{
uint32_t const val = reliable_value | unreliables[i];
}
To get the unreliable bits, you could just loop over [0, 1024) (maybe even inside the existing loop) and "spread" the bits out to the requisite positions.

You can iterate through the 1024 different settings of the bits in b like so:
unsigned long b = 1145324694;
unsigned long c;
c = 0;
do {
printf("%#.8lx\n", c & b);
c = (c | ~b) + 1;
} while (c);
To use these to modify a you can just use XOR:
unsigned long a = 2323409845;
unsigned long b = 1145324694;
unsigned long c;
c = 0;
do {
printf("%#.8lx\n", a ^ (c & b));
c = (c | ~b) + 1;
} while (c);
This method has the advantages that you don't need to precalculate any tables, and you don't need to hardcode the 1024 - it will loop based entirely on the number of 1 bits in b.
It's also a relatively simple matter to parallelise this algorithm using integer vector instructions.

This follows essentially the technique used by Kerrek, but fleshes out the difficult parts:
int* getValues(int value, int unreliable_bits)
{
int unreliables[10];
int *values = malloc(1024 * sizeof(int));
int i = 0;
int mask;
The function definition and some variable declarations. Here, value is your A and unreliable_bits is your B.
value &= ~unreliable_bits;
Mask out the unreliable bits to ensure that ORing an integer containing some unreliable bits and value will yield what we want.
for(mask = 1;i < 10;mask <<= 1)
{
if(mask & unreliable_bits)
unreliables[i++] = mask;
}
Here, we get each unreliable bit into an individual int for use later.
for(i = 0;i < 1024;i++)
{
int some_unreliables = 0;
int j;
for(j = 0;j < 10;j++)
{
if(i & (1 << j))
some_unreliables |= unreliables[j];
}
values[i] = value | some_unreliables;
}
The meat of the function. The outer loop is over each of the outputs we want. Then, we use the lowest 10 bits of the loop variable i to determine whether to turn on each unreliable bit, using the fact that the integers 0 to 1023 go through all possibilities of the lowest 10 bits.
return values;
}
Finally, return the array we built. Here is a short main that can be used to test it with the values for A and B given in your question:
int main()
{
int *values = getValues(0x8A7C6BB5, 0x44444496);
int i;
for(i = 0;i < 1024;i++)
printf("%X\n", values[i]);
}