Reversing the bytes in a long in C - c

Okay, so I'm trying to write a function to reverse a long (64 bytes) in C, and I'm getting some weird results with my bitshifting.
long reverse_long(long x) {
int i;
for(i=0; i<4; i++) {
long a = 0x00000000000000FF<<(8*i);
long b = 0xFF00000000000000>>(8*i);
a = x&a;
b = x&b;
a=a<<8*(7-2*i);
b=b>>8*(7-2*i);
x=x&(~(0x00000000000000FF<<(8*i)));
x=x&(~(0xFF00000000000000>>(8*i)));
x=x|a;
x=x|b;
}
}
On line 4 (long a = 0x00000000000000FF<<(8*i)), I'm shifting a byte of ones to the left by 8 bits for each iteration of the loop, which works fine for the first, second, and third iterations, but on the fourth iteration I get something like 0xFFFFFFFF000000, when I should be getting 0x00000000FF000000.
Line 5 (long b = 0x00000000000000FF>>(8*i)) works just fine though, and gives me the value 0x000000FF00000000.
Can anyone tell me what's going on here?

To understand potential problems in your code you need to understand the following things:
The type and value of integer literals
Rules about left-shifting signed values
Rules about right-shifting signed values
Rules about ~ on signed values
Rules about shifting a value by the width of its type or more
The behaviour of out-of-range integral conversions
That's quite a lot of things to remember. To avoid having to deal with all sorts of weird issues (for example, long a = 0x00000000000000FF<<(8*i); causes undefined behaviour when i == 3), I would recommend the following:
Only use unsigned variables and constants (including x)
Use constants which are the correct width for the type
Further, your code makes the assumption that long is 64-bit. This is not always true. It would be better to do one of the following two things:
Have your code work for unsigned long, whatever size unsigned long might be
Use uint64_t instead of long
To cut a long story short, this is how your code should look if we just fix errors relating to the points I listed above (and do not change the algorithm):
uint64_t reverse_long(uint64_t x)
{
int i;
for(i=0; i<4; i++)
{
uint64_t a = 0xFFull << (8*i);
uint64_t b = 0xFF00000000000000ull >> (8*i);
a = x&a;
b = x&b;
a=a<<8*(7-2*i);
b=b>>8*(7-2*i);
x=x&(~(0xFFull<<(8*i)));
x=x&(~(0xFF00000000000000ull>>(8*i)));
x=x|a;
x=x|b;
}
return x; // don't forget this
}
note: I have used ull suffix to create 64-bit literals. Actually this only guarantees at least 64 bit, but since everything is unsigned here it makes no difference, excess bits will just get truncated. To be very precise, write (uint64_t)0xFF instead of 0xFFull, etc.

You've received excellent advice on where you code went awry, but I thought you might like to see an alternate approach to reversing that might be a bit simpler.
uint64_t reverse_long(uint64_t n) {
uint8_t* a = (uint8_t*)&n;
uint8_t* b = a + 7;
while(a < b) {
uint8_t t = *b;
*b-- = *a;
*a++ = t;
}
}

a) Regarding your error:
What you doing there:
long a = 0x00000000000000FF<<(8*i);
Create signed int constant 0xFF;
Shit it left by i bytes
When it shift by 3 bytes, constant become: 0xFF000000;
When it assign it to long signed, performed sign extension:
0xFF000000 -> 0xFFFFFFFFFF000000;
b) Regarding your code:
There is exist more easy way to write your function, for example:
unsigned long reverse_long(unsigned long x) {
unsigned long rc = 0;
int i = 8;
do {
rc = (rc << 8) | (unsigned char)x;
x >>= 8;
} while(--i);
return rc;
}

The right shifting of signed longs is problematic when they're negative. This minor variant on your code, which is only safe for 64-bit machines where sizeof(long) == 8), ensures that the constants are long and the intermediate variables a and b are unsigned long to avoid those problems. The code contains quite a lot of diagnostics.
#include <stdio.h>
long reverse_long(long x);
long reverse_long(long x)
{
int i;
for (i = 0; i < 4; i++)
{
printf("x0 0x%.16lX\n", x);
unsigned long a = 0x00000000000000FFL << (8 * i);
unsigned long b = 0xFF00000000000000L >> (8 * i);
a &= x;
b &= x;
printf("a0 0x%.16lX; b0 0x%.16lX\n", a, b);
a <<= 8 * (7 - 2 * i);
b >>= 8 * (7 - 2 * i);
printf("a1 0x%.16lX; b1 0x%.16lX\n", a, b);
x &= (~(0x00000000000000FFL << (8 * i)));
x &= (~(0xFF00000000000000L >> (8 * i)));
printf("x1 0x%.16lX\n", x);
x |= a | b;
printf("x2 0x%.16lX\n", x);
}
return x;
}
int main(void)
{
long x = 0xFEDCBA9876543210L;
printf("0x%.16lX <=> 0x%.16lX\n", x, reverse_long(x));
return 0;
}
Output:
x0 0xFEDCBA9876543210
a0 0x0000000000000010; b0 0xFE00000000000000
a1 0x1000000000000000; b1 0x00000000000000FE
x1 0x00DCBA9876543200
x2 0x10DCBA98765432FE
x0 0x10DCBA98765432FE
a0 0x0000000000003200; b0 0x00DC000000000000
a1 0x0032000000000000; b1 0x000000000000DC00
x1 0x1000BA98765400FE
x2 0x1032BA987654DCFE
x0 0x1032BA987654DCFE
a0 0x0000000000540000; b0 0x0000BA0000000000
a1 0x0000540000000000; b1 0x0000000000BA0000
x1 0x103200987600DCFE
x2 0x1032549876BADCFE
x0 0x1032549876BADCFE
a0 0x0000000076000000; b0 0x0000009800000000
a1 0x0000007600000000; b1 0x0000000098000000
x1 0x1032540000BADCFE
x2 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
Alternative Implementations
This is a variant of the program above, with the reverse_long() changed to reverse_uint64_v1() and using uint64_t instead of long and unsigned long. The printing is upgraded using PRIX64 format, but also commented out since it is being used in a performance test. The reverse_uint64_v2() function does fewer operations per cycle, though it does more cycles (8 instead of 4). It copies the low order byte of what's left of the input value into the low order byte of the current output value after it's been shifted left 8 places. The reverse_uint64_v3() function does a loop-unrolling of reverse_uint64_v2(), and micro-optimizes by avoiding one assignment to b and one extra shift at the end.
#include <inttypes.h>
#include <stdio.h>
#include "timer.h"
uint64_t reverse_uint64_v1(uint64_t x);
uint64_t reverse_uint64_v2(uint64_t x);
uint64_t reverse_uint64_v3(uint64_t x);
uint64_t reverse_uint64_v1(uint64_t x)
{
for (int i = 0; i < 4; i++)
{
//printf("x0 0x%.16" PRIX64 "\n", x);
uint64_t a = UINT64_C(0x00000000000000FF) << (8 * i);
uint64_t b = UINT64_C(0xFF00000000000000) >> (8 * i);
a &= x;
b &= x;
//printf("a0 0x%.16" PRIX64 "; b0 0x%.16" PRIX64 "\n", a, b);
a <<= 8 * (7 - 2 * i);
b >>= 8 * (7 - 2 * i);
//printf("a1 0x%.16" PRIX64 "; b1 0x%.16" PRIX64 "\n", a, b);
x &= ~(UINT64_C(0x00000000000000FF) << (8 * i));
x &= ~(UINT64_C(0xFF00000000000000) >> (8 * i));
//printf("x1 0x%.16" PRIX64 "\n", x);
x |= a | b;
//printf("x2 0x%.16" PRIX64 "\n", x);
}
return x;
}
uint64_t reverse_uint64_v2(uint64_t x)
{
uint64_t r = 0;
for (size_t i = 0; i < sizeof(uint64_t); i++)
{
uint64_t b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
}
return r;
}
uint64_t reverse_uint64_v3(uint64_t x)
{
uint64_t b;
uint64_t r;
r = x & 0xFF; // Optimization 1
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
x >>= 8;
b = x & 0xFF;
r = (r << 8) | b;
// x >>= 8; // Optimization 2
return r;
}
static void timing_test(uint64_t (*reverse)(uint64_t))
{
Clock clk;
clk_init(&clk);
uint64_t ur = 0;
uint64_t lb = UINT64_C(0x0123456789ABCDEF);
uint64_t ub = UINT64_C(0xFEDCBA9876543210);
uint64_t inc = UINT64_C(0x287654321);
uint64_t cnt = 0;
clk_start(&clk);
for (uint64_t u = lb; u < ub; u += inc)
{
ur += (*reverse)(u);
cnt++;
}
clk_stop(&clk);
char buffer[32];
printf("Sum = 0x%.16" PRIX64 " Count = %" PRId64 " Time = %s\n", ur, cnt,
clk_elapsed_us(&clk, buffer, sizeof(buffer)));
}
int main(void)
{
uint64_t u = UINT64_C(0xFEDCBA9876543210);
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v1(u));
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v2(u));
printf("0x%.16" PRIX64 " <=> 0x%.16" PRIX64 "\n", u, reverse_uint64_v3(u));
timing_test(reverse_uint64_v1);
timing_test(reverse_uint64_v2);
timing_test(reverse_uint64_v3);
timing_test(reverse_uint64_v1);
timing_test(reverse_uint64_v2);
timing_test(reverse_uint64_v3);
return 0;
}
Example output:
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
0xFEDCBA9876543210 <=> 0x1032547698BADCFE
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.543540
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.822616
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.303825
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.943668
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.314660
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.295862
The sum and count have two purposes. First, they provide a cross-check that the results from the three functions are the same. Second, they ensure that the compiler doesn't do anything like optimize the whole loop out of business.
As you can see, there is not a lot of difference between the v2 and v3 timings, but the v1 code is quite a bit slower than the v2 or v3 code. For clarity, then, I'd use the v2 code.
For comparison, I also added a 'do nothing' function:
uint64_t reverse_uint64_v4(uint64_t x)
{
return x;
}
Clearly, the sum from this is different, but the count is the same, so it measures the overhead of the loop control and counting. The times I got on two runs were:
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.965360
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.197267
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 7.454553
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.607310
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.381292
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.804442
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.797625
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.541233
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.438374
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.805865
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.797086
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.532735
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 8.426701
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.824182
Sum = 0x0BC6E4692C2EC35A Count = 1683264863 Time = 6.834344
Sum = 0x09EBA33CFF9869C2 Count = 1683264863 Time = 3.510904
Clearly, about half the time in the test function is in the loop and function call overhead.

Related

Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types

Apologies for the confusing title. I'm not sure how to better describe what I'm trying to accomplish. I'm essentially trying to do the reverse of
getting the high half of a 64-bit multiplication in C for platforms where
int64_t divHi64(int64_t dividend, int64_t divisor) {
return ((__int128)dividend << 64) / (__int128)divisor;
}
isn't possible due to lacking support for __int128.
This can be done without a multi-word division
Suppose we want to do ⌊264 × x⁄y⌋ then we can transform the expression like this
The first term is trivially done as ((-y)/y + 1)*x as per this question How to compute 2⁶⁴/n in C?
The second term is equivalent to (264 % y)/y*x and is a little bit trickier. I've tried various ways but all need 128-bit multiplication and 128/64 division if using only integer operations. That can be done using the algorithms to calculate MulDiv64(a, b, c) = a*b/c in the below questions
Most accurate way to do a combined multiply-and-divide operation in 64-bit?
How to multiply a 64 bit integer by a fraction in C++ while minimizing error?
(a * b) / c MulDiv and dealing with overflow from intermediate multiplication
How can I multiply and divide 64-bit ints accurately?
However they may be slow, and if you have those functions you calculate the whole expression more easily like MulDiv64(x, UINT64_MAX, y) + x/y + something without messing up with the above transformation
Using long double seems to be the easiest way if it has 64 bits of precision or more. So now it can be done by (264 % y)/(long double)y*x
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t mod_y = UINT64_MAX % y + 1;
uint64_t result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (uint64_t)((mod_y/(long double)y)*x);
return result;
}
The overflow check was omitted for simplification. A slight modification will be needed if you need signed division
If you're targeting 64-bit Windows but you're using MSVC which doesn't have __int128 then now it has a 128-bit/64-bit divide intrinsic which simplifies the job significantly without a 128-bit integer type. You still need to handle overflow though because the div instruction will throw an exception on that case
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t high, remainder;
uint64_t low = _umul128(UINT64_MAX, y, &high);
if (x <= high /* && 0 <= low */)
return _udiv128(x, 0, y, &remainder);
// overflow case
errno = EOVERFLOW;
return 0;
}
The overflow checking above is can be simplified to checking whether x < y, because if x >= y then the result will overflow
See also
Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
Exhaustive tests on 16/16 bit division shows that my solution works correctly for all cases. However you do need double even though float has more than 16 bits of precision, otherwise occasionally a less-than-one result will be returned. It may be fixed by adding an epsilon value before truncating: (uint64_t)((mod_y/(long double)y)*x + epsilon). That means you'll need __float128 (or the -m128bit-long-double option) in gcc for precise 64/64-bit output if you don't correct the result with epsilon. However that type is available on 32-bit targets, unlike __int128 which is supported only on 64-bit targets, so life will be a bit easier. Of course you can use the function as-is if just a very close result is needed
Below is the code I've used for verifying
#include <thread>
#include <iostream>
#include <limits>
#include <climits>
#include <mutex>
std::mutex print_mutex;
#define MAX_THREAD 8
#define NUM_BITS 27
#define CHUNK_SIZE (1ULL << NUM_BITS)
// typedef uint32_t T;
// typedef uint64_t T2;
// typedef double D;
typedef uint64_t T;
typedef unsigned __int128 T2; // the type twice as wide as T
typedef long double D;
// typedef __float128 D;
const D epsilon = 1e-14;
T divHi(T x, T y) {
T mod_y = std::numeric_limits<T>::max() % y + 1;
T result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (T)((mod_y/(D)y)*x + epsilon);
return result;
}
void testdiv(T midpoint)
{
T begin = midpoint - CHUNK_SIZE/2;
T end = midpoint + CHUNK_SIZE/2;
for (T i = begin; i != end; i++)
{
T x = i & ((1 << NUM_BITS/2) - 1);
T y = CHUNK_SIZE/2 - (i >> NUM_BITS/2);
// if (y == 0)
// continue;
auto q1 = divHi(x, y);
T2 q2 = ((T2)x << sizeof(T)*CHAR_BIT)/y;
if (q2 != (T)q2)
{
// std::lock_guard<std::mutex> guard(print_mutex);
// std::cout << "Overflowed: " << x << '&' << y << '\n';
continue;
}
else if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << (T)q2 << '\n';
}
}
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << "Done testing [" << begin << ", " << end << "]\n";
}
uint16_t divHi16(uint32_t x, uint32_t y) {
uint32_t mod_y = std::numeric_limits<uint16_t>::max() % y + 1;
int result = ((((1U << 16) - y)/y) + 1)*x;
if (mod_y != y)
result += (mod_y/(double)y)*x;
return result;
}
void testdiv16(uint32_t begin, uint32_t end)
{
for (uint32_t i = begin; i != end; i++)
{
uint32_t y = i & 0xFFFF;
if (y == 0)
continue;
uint32_t x = i & 0xFFFF0000;
uint32_t q2 = x/y;
if (q2 > 0xFFFF) // overflowed
continue;
uint16_t q1 = divHi16(x >> 16, y);
if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << q2 << '\n';
}
}
}
int main()
{
std::thread t[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
t[i] = std::thread(testdiv, std::numeric_limits<T>::max()/MAX_THREAD*i);
for (int i = 0; i < MAX_THREAD; i++)
t[i].join();
std::thread t2[MAX_THREAD];
constexpr uint32_t length = std::numeric_limits<uint32_t>::max()/MAX_THREAD;
uint32_t begin, end = length;
for (int i = 0; i < MAX_THREAD - 1; i++)
{
begin = end;
end += length;
t2[i] = std::thread(testdiv16, begin, end);
}
t2[MAX_THREAD - 1] = std::thread(testdiv, end, UINT32_MAX);
for (int i = 0; i < MAX_THREAD; i++)
t2[i].join();
std::cout << "Done\n";
}

Interleave 4 byte ints to 8 byte int

I'm currently working to create a function which accepts two 4 byte unsigned integers, and returns an 8 byte unsigned long. I've tried to base my work off of the methods depicted by this research but all my attempts have been unsuccessful. The specific inputs I am working with are: 0x12345678 and 0xdeadbeef, and the result I'm looking for is 0x12de34ad56be78ef. This is my work so far:
unsigned long interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
int shift = 33;
for(int i = 64; i > 0; i-=16){
shift -= 8;
//printf("%d\n", i);
//printf("%d\n", shift);
result |= (x & i) << shift;
result |= (y & i) << (shift-1);
}
}
However, this function keeps returning 0xfffffffe which is incorrect. I am printing and verifying these values using:
printf("0x%x\n", z);
and the input is initialized like so:
uint32_t x = 0x12345678;
uint32_t y = 0xdeadbeef;
Any help on this topic would be greatly appreciated, C has been a very difficult language for me, and bitwise operations even more so.
This can be done based on interleaving bits, but skipping some steps so it only interleaves bytes. Same idea: first spread out the bytes in a couple of steps, then combine them.
Here is the plan, illustrated with my amazing freehand drawing skills:
In C (not tested):
// step 1, moving the top two bytes
uint64_t a = (((uint64_t)x & 0xFFFF0000) << 16) | (x & 0xFFFF);
// step 2, moving bytes 2 and 6
a = ((a & 0x00FF000000FF0000) << 8) | (a & 0x000000FF000000FF);
// same thing with y
uint64_t b = (((uint64_t)y & 0xFFFF0000) << 16) | (y & 0xFFFF);
b = ((b & 0x00FF000000FF0000) << 8) | (b & 0x000000FF000000FF);
// merge them
uint64_t result = (a << 8) | b;
Using SSSE3 PSHUFB has been suggested, it'll work but there is an instruction that can do a byte-wise interleave in one go, punpcklbw. So all we need to really do is get the values into and out of vector registers, and that single instruction will then just care of it.
Not tested:
uint64_t interleave(uint32_t x, uint32_t y) {
__m128i xvec = _mm_cvtsi32_si128(x);
__m128i yvec = _mm_cvtsi32_si128(y);
__m128i interleaved = _mm_unpacklo_epi8(yvec, xvec);
return _mm_cvtsi128_si64(interleaved);
}
With bit-shifting and bitwise operations (endianness independent):
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
for(uint8_t i = 0; i < 4; i ++){
result |= ((x & (0xFFull << (8*i))) << (8*(i+1)));
result |= ((y & (0xFFull << (8*i))) << (8*i));
}
return result;
}
With pointers (endianness dependent):
uint64_t interleave(uint32_t x, uint32_t y){
uint64_t result = 0;
uint8_t * x_ptr = (uint8_t *)&x;
uint8_t * y_ptr = (uint8_t *)&y;
uint8_t * r_ptr = (uint8_t *)&result;
for(uint8_t i = 0; i < 4; i++){
*(r_ptr++) = y_ptr[i];
*(r_ptr++) = x_ptr[i];
}
return result;
}
Note: this solution assumes little-endian byte order
You could do it like this:
uint64_t interleave(uint32_t x, uint32_t y)
{
uint64_t z;
unsigned char *a = (unsigned char *)&x; // 1
unsigned char *b = (unsigned char *)&y; // 1
unsigned char *c = (unsigned char *)&z;
c[0] = a[0];
c[1] = b[0];
c[2] = a[1];
c[3] = b[1];
c[4] = a[2];
c[5] = b[2];
c[6] = a[3];
c[7] = b[3];
return z;
}
Interchange a and b on the lines marked 1 depending on ordering requirement.
A version with shifts, where the LSB of y is always the LSB of the output as in your example, is:
uint64_t interleave(uint32_t x, uint32_t y)
{
return
(y & 0xFFull)
| (x & 0xFFull) << 8
| (y & 0xFF00ull) << 8
| (x & 0xFF00ull) << 16
| (y & 0xFF0000ull) << 16
| (x & 0xFF0000ull) << 24
| (y & 0xFF000000ull) << 24
| (x & 0xFF000000ull) << 32;
}
The compilers I tried don't seem to do a good job of optimizing either version so if this is a performance critical situation then maybe the inline assembly suggestion from comments is the way to go.
use union punning. Easy for the compiler to optimize.
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef union
{
uint64_t u64;
struct
{
union
{
uint32_t a32;
uint8_t a8[4]
};
union
{
uint32_t b32;
uint8_t b8[4]
};
};
uint8_t u8[8];
}data_64;
uint64_t interleave(uint32_t a, uint32_t b)
{
data_64 in , out;
in.a32 = a;
in.b32 = b;
for(size_t index = 0; index < sizeof(a); index ++)
{
out.u8[index * 2 + 1] = in.a8[index];
out.u8[index * 2 ] = in.b8[index];
}
return out.u64;
}
int main(void)
{
printf("%llx\n", interleave(0x12345678U, 0xdeadbeefU)) ;
}

How to multiply 2 uint8 modulo a big number without using integer type in C language [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
If A and B are of type uint8_t and I want the result C=AxB % N where N is 2^16, how do i do this if I can't use integers (so I can't declare N as an integer, only uint8_t) in C language?
N.B: A, B and C are stored in uint8 arrays, so they are "expressed" as uint8 but their values can be bigger.
In general there is no easy way to do this.
Firstly you need to implement the multiply with carry between A and B for each uint8_t block. See the answer here.
Division with 2^16 really mean "disregard" the last 16 bits, "don't use" the last two uint8_t (as you use the array of int.). As you have the modulus operator, this means just the opposite, so you only need to get the last two uint8_ts.
Take the lowest two uint8 of A (say a0 and a1) and B (say b0 and b1):
split each uint8 in high and low part
a0h = a0 >> 4; ## the same as a0h = a0/16;
a0l = a0 % 16; ## the same as a0l = a0 & 0x0f;
a1h = a1 >> 4;
a1l = a1 % 16;
b0h = b0 >> 4;
b0l = b0 % 16;
b1h = b1 >> 4;
b1l = b1 % 16;
Multiply the lower parts first (x is a buffer var)
x = a0l * b0l;
The first part of the result is the last four bits of x, let's call it s0l
s0l = x % 16;
The top for bits of x are carry.
c = x>>4;
multiply the higher parts of first uint8 and add carry.
x = (a0h * b0h) + c;
The first part of the result is the last four bits of x, let's call it s0h. And we need to get carry again.
s0h = x % 16;
c = x>>4;
We can now combine the s0:
s0 = (s0h << 4) + s0l;
Do exactly the same for the s1 (but don't forget to add the carry!):
x = (a1l * b1l) + c;
s1l = x % 16;
c = x>>4;
x = (a1h * b1h) + c;
s1h = x % 16;
c = x>>4;
s1 = (s1h << 4) + s1l;
Your result at this point is c, s1 and s0 (you need carry for next multiplications eg. s2, s3, s4,). As your formula says %(2^16) you already have your result - s1 and s2. If you have to divide with something else, you should do something similar to the code above, but for division. In this case be careful to catch the dividing with zero, it will give you NAN or something!
You can put A, B, C and S in array and loop it through the indexes to make code cleaner.
Here's my effort. I took the liberty of using larger integers and pointers for looping through the arrays. The numbers are represented by arrays of uint8_t in big-endian order. All the intermediate results are kept in uint8_t variables. The code could be made more efficient if intermediate results could be stored in wider integer variables!
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
static void add_c(uint8_t *r, size_t len_r, uint8_t x)
{
uint8_t o;
while (len_r--) {
o = r[len_r];
r[len_r] += x;
if (o <= r[len_r])
break;
x = 1;
}
}
void multiply(uint8_t *res, size_t len_res,
const uint8_t *a, size_t len_a, const uint8_t *b, size_t len_b)
{
size_t ia, ib, ir;
for (ir = 0; ir < len_res; ir++)
res[ir] = 0;
for (ia = 0; ia < len_a && ia < len_res; ia++) {
uint8_t ah, al, t;
t = a[len_a - ia - 1];
ah = t >> 4;
al = t & 0xf;
for (ib = 0; ib < len_b && ia + ib < len_res; ib++) {
uint8_t bh, bl, x, o, c0, c1;
t = b[len_b - ib - 1];
bh = t >> 4;
bl = t & 0xf;
c0 = al * bl;
c1 = ah * bh;
o = c0;
t = al * bh;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
o = c0;
t = ah * bl;
x = (t & 0xf) << 4;
c0 += x;
x = (t >> 4);
c1 += x;
if (o > c0)
c1++;
add_c(res, len_res - ia - ib, c0);
add_c(res, len_res - ia - ib - 1, c1);
}
}
}
int main(void)
{
uint8_t a[2] = { 0xee, 0xdd };
uint8_t b[2] = { 0xcc, 0xbb };
uint8_t r[4];
multiply(r, sizeof(r), a, sizeof(a), b, sizeof(b));
printf("0x%02X%02X * 0x%02X%02X = 0x%02X%02X%02X%02X\n",
a[0], a[1], b[0], b[1], r[0], r[1], r[2], r[3]);
return 0;
}
Output:
0xEEDD * 0xCCBB = 0xBF06976F

Multi-precision addition implementation

I am trying to implement multi-precision arithmetic for 256-bit operands based on radix-2^32 representation. In order to do that I defined operands as:
typedef union UN_256fe{
uint32_t uint32[8];
}UN_256fe;
and here is my MP addition function:
void add256(UN_256fe* A, UN_256fe* B, UN_256fe* result){
uint64_t t0, t1;
t0 = (uint64_t) A->uint32[7] + B->uint32[7];
result->uint32[7] = (uint32_t)t0;
t1 = (uint64_t) A->uint32[6] + B->uint32[6] + (t0 >> 32);
result->uint32[6] = (uint32_t)t1;
t0 = (uint64_t) A->uint32[5] + B->uint32[5] + (t1 >> 32);
result->uint32[5] = (uint32_t)t0;
t1 = (uint64_t) A->uint32[4] + B->uint32[4] + (t0 >> 32);
result->uint32[4] = (uint32_t)t1;
t0 = (uint64_t) A->uint32[3] + B->uint32[3] + (t1 >> 32);
result->uint32[3] = (uint32_t)t0;
t1 = (uint64_t) A->uint32[2] + B->uint32[2] + (t0 >> 32);
result->uint32[2] = (uint32_t)t1;
t0 = (uint64_t) A->uint32[1] + B->uint32[1] + (t1 >> 32);
result->uint32[1] = (uint32_t)t0;
t1 = (uint64_t) A->uint32[0] + B->uint32[0] + (t0 >> 32);
result->uint32[0] = (uint32_t)t1;
}
I implemented it without using loop for simplicity. Now when I test my function inside main:
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
#include "mmulv3.0.h"
int main(){
UN_256fe result;
uint32_t c;
UN_256fe a = {0x00000f00,0xff00ff00,0xffff0000,0xf0f0f0f0,0x00000000,0xffffffff,0xf0fff000,0xfff0fff0};
UN_256fe b = {0x0000f000,0xff00ff00,0xffff0000,0xf0f0f0f0,0x00000000,0xffffffff,0xf0fff000,0xfff0ffff};
c = 2147483577;
printf("a:\n");
for(int i = 0; i < 8; i +=1){
printf("%"PRIu32, a.uint32[i]);
}
printf("\nb:\n");
for(int i = 0; i < 8; i +=1){
printf("%"PRIu32, b.uint32[i]);
}
add256(&a, &b, &result);
printf("\nResult for add256(a,b) = a + b:\n");
for(int i = 0; i < 8; i +=1){
printf("%"PRIu32, result.uint32[i]);
}
return 0;
}
I've got:
a:
38404278255360429490176040423221600429496729540433049604293984240
b:
614404278255360429490176040423221600429496729540433049604293984255
Result for add256(a,b) = a + b:
652814261543425429483622537896770241429496729537916426254293001199
However, when I verified my result with sage, I've got:
sage: a=38404278255360429490176040423221600429496729540433049604293984240
sage: b=614404278255360429490176040423221600429496729540433049604293984255
sage: a+b
652808556510720858980352080846443200858993459080866099208587968495
Would you please help me out here?
The algorithm for addition seems correct, but you cannot print these 256-bit integers in decimal by converting each component individually.
Just think of this simple example: 0x100000000, stored as { 0,0,0,0,0,0,1,0 }, will print as 10 instead of 4294967296. Base-10 conversion is significantly more complex than the simple addition.
Would have helped to write a space character between the numbers.
But how much is 384 + 6144 + 1? I think you are adding from the wrong end.
To print a multi-precision number in decimal takes a bit of code as many of the digits depend on the entire uint32[8].
To print it out in hexadecimal is much easier.
fputs("0x", stdout);
for (int i = 0; i < 8; i +=1){
printf("%08" PRIX32, a.uint32[i]);
}

how to calculate (a times b) divided by c only using 32-bit integer types even if a times b would not fit such a type

Consider the following as a reference implementation:
/* calculates (a * b) / c */
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint64_t x = a;
x = x * b;
x = x / c;
return x;
}
I am interested in an implementation (in C or pseudocode) that does not require a 64-bit integer type.
I started sketching an implementation that outlines like this:
/* calculates (a * b) / c */
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t d1, d2, d1d2;
d1 = (1 << 10);
d2 = (1 << 10);
d1d2 = (1 << 20); /* d1 * d2 */
return ((a / d1) * (b /d2)) / (c / d1d2);
}
But the difficulty is to pick values for d1 and d2 that manage to avoid the overflow ((a / d1) * (b / d2) <= UINT32_MAX) and minimize the error of the whole calculation.
Any thoughts?
I have adapted the algorithm posted by Paul for unsigned ints (by omitting the parts that are dealing with signs). The algorithm is basically Ancient Egyptian multiplication of a with the fraction floor(b/c) + (b%c)/c (with the slash denoting real division here).
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
while(a)
{
if (a & 1)
{
q += qn;
r += rn;
if (r >= c)
{
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn <<= 1;
if (rn >= c)
{
qn++;
rn -= c;
}
}
return q;
}
This algorithm will yield the exact answer as long as it fits in 32 bits. You can optionally also return the remainder r.
The simplest way would be converting the intermediar result to 64 bits, but, depending on value of c, you could use another approach:
((a/c)*b + (a%c)*(b/c) + ((a%c)*(b%c))/c
The only problem is that the last term could still overflow for large values of c. still thinking about it..
You can first divide a by c and also get the reminder of the division, and multiply the reminder with b before dividing it by c. That way you only lose data in the last division, and you get the same result as making the 64 bit division.
You can rewrite the formula like this (where \ is integer division):
a * b / c =
(a / c) * b =
(a \ c + (a % c) / c) * b =
(a \ c) * b + ((a % c) * b) / c
By making sure that a >= b, you can use larger values before they overflow:
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c) {
uint32_t hi = a > b ? a : b;
uint32_t lo = a > b ? b : a;
return (hi / c) * lo + (hi % c) * lo / c;
}
Another approach would be to loop addition and subtraction instead of multiplying and dividing, but that is of course a lot more work:
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c) {
uint32_t hi = a > b ? a : b;
uint32_t lo = a > b ? b : a;
uint32_t sum = 0;
uint32_t cnt = 0;
for (uint32_t i = 0; i < hi; i++) {
sum += lo;
while (sum >= c) {
sum -= c;
cnt++;
}
}
return cnt;
}
Searching on www.google.com/codesearch turns up a number of implementations, including this wonderfuly obvious one. I particularly like the extensive comments and well chosen variable names
INT32 muldiv(INT32 a, INT32 b, INT32 c)
{ INT32 q=0, r=0, qn, rn;
int qneg=0, rneg=0;
if (c==0) c=1;
if (a<0) { qneg=!qneg; rneg=!rneg; a = -a; }
if (b<0) { qneg=!qneg; rneg=!rneg; b = -b; }
if (c<0) { qneg=!qneg; c = -c; }
qn = b / c;
rn = b % c;
while(a)
{ if (a&1) { q += qn;
r += rn;
if(r>=c) { q++; r -= c; }
}
a >>= 1;
qn <<= 1;
rn <<= 1;
if (rn>=c) {qn++; rn -= c; }
}
result2 = rneg ? -r : r;
return qneg ? -q : q;
}
http://www.google.com/codesearch/p?hl=en#HTrPUplLEaU/users/mr/MCPL/mcpl.tgz|gIE-sNMlwIs/MCPL/mintcode/sysc/mintsys.c&q=muldiv%20lang:c
I implemented the Sven's code as UINT16, to intensively test it:
uint16_t muldiv16(uint16_t a, uint16_t b, uint16_t c);
int main(int argc, char *argv[]){
uint32_t a;
uint32_t b;
uint32_t c;
uint16_t r1, r2;
// ~167 days, estimated on i7 6700k, single thread.
// Split the 'a' range, to run several instances of this code on multi-cores processor
// ~1s, with an UINT8 implementation
for(a=0; a<=UINT16_MAX; a++){
for(b=0; b<=UINT16_MAX; b++){
for(c=1; c<=UINT16_MAX; c++){
r1 = uint16_t( a*b/c );
r2 = muldiv16(uint16_t(a), uint16_t(b), uint16_t(c));
if( r1 != r2 ){
std::cout << "Err: " << a << " * " << b << " / " << c << ", result: " << r2 << ", exected: " << r1 << std::endl;
return -1;
}
}
}
std::cout << a << std::endl
}
std::cout << "Done." << std::endl;
return 0;
}
Unfortunately, it seems that it is limited to UINT31 for 'b' (0-2147483647).
Here is my correction, that seems to work (not completed the test on UINT16, but run a lot. Completed on UINT8).
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
while(a)
{
if (a & 1)
{
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000UL;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
return q;
}
Edit: an improvement, that returns the remainder, manages the round, warns about overflow and, of course, manages the full range of UINT32 for a, b and c:
typedef enum{
ROUND_DOWNWARD=0,
ROUND_TONEAREST,
ROUND_UPWARD
}ROUND;
//remainder is always positive for ROUND_DOWN ( a * b = c * q + remainder )
//remainder is always negative for ROUND_UPWARD ( a * b = c * q - remainder )
//remainder is signed for ROUND_CLOSEST ( a * b = c * q + sint32_t(remainder) )
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c, uint32_t *remainder, ROUND round, uint8_t *ovf)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
uint8_t o = 0;
uint8_t rup;
while(a)
{
if (a & 1)
{
o |= (q > UINT32_MAX-qn);
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
o |= (q == UINT32_MAX);
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
rup = (round == ROUND_UPWARD && r);
rup |= (round == ROUND_TONEAREST && ((r<<1) >= c || r & 0x80000000));
if(rup)
{ //round
o |= (q == UINT32_MAX);
q++;
r = (round == ROUND_UPWARD) ? c-r : r-c;
}
if(remainder)
*remainder = r;
if(ovf)
*ovf = o;
return q;
}
Maybe there could exist another approach, perhaps even more efficient:
8-bits, 16-bits and 32-bits MCU are able to compute 64-bits calculations (long long int).
Anyone known how the compilers emulate it?
Edit 2:
Here is some interresting timings, on 8-bits MCU:
UINT8 x UINT8 / UINT8: 3.5µs
UINT16 x UINT16 / UINT16: 22.5µs, muldiv8: 29.9 to 45.3µs
UINT32 x UINT32 / UINT32: 84µs, muldiv16: 120 to 189µs
FLOAT32 * FLOAT32 / FLOAT32: 40.2 ot 135.5µs, muldiv32: 1.193 to 1.764ms
And on 32-bits MCU:
Type - optimized code - without optimization
UINT32: 521ns - 604ns
UINT64: 2958ns - 3313ns
FLOAT32: 2563ns - 2688ns
muldiv32: 6791ns - 25375ns
So, the compilers are clever than this C algorithm.
And it is always better to work with float variables (even without FPU) than whith integer bigger than the native registers (even though float32 has worst precision than uint32, starting 16777217).
Edit3: Ok, so: my N-bits MCU are using a N-bits MUL N-bits native instruction, that produce a 2N-bits result, stored into two N-Bits registers.
Here, you can found a C implementation (prefer the EasyasPi's solution)
But they don't have 2N-bits DIV N-bits native instruction. Instead, they are using the __udivdi3 function from gcc, with loops and 2N-bits variables (here, UINT64). So, this cannot be a solution for the original question.
If b and c are both constants, you can calculate the result very simply using Egyptian fractions.
For example. y = a * 4 / 99 can be written as
y = a / 25 + a / 2475
You can express any fraction as a sum of Egyptian fractions, as explained in answers to Egyptian Fractions in C.
Having b and c fixed in advance might seem like a bit of a restriction, but this method is a lot simpler than the general case answered by others.
I suppose there are reasons you can't do
x = a/c;
x = x*b;
are there? And maybe add
y = b/c;
y = y*a;
if ( x != y )
return ERROR_VALUE;
Note that, since you're using integer division, a*b/c and a/c*b might lead to different values if c is bigger than a or b. Also, if both a and b are smaller than c it won't work.

Resources