Integer cube root - c

I'm looking for fast code for 64-bit (unsigned) cube roots. (I'm using C and compiling with gcc, but I imagine most of the work required will be language- and compiler-agnostic.) I will denote by ulong a 64-bit unisgned integer.
Given an input n I require the (integral) return value r to be such that
r * r * r <= n && n < (r + 1) * (r + 1) * (r + 1)
That is, I want the cube root of n, rounded down. Basic code like
return (ulong)pow(n, 1.0/3);
is incorrect because of rounding toward the end of the range. Unsophisticated code like
ulong
cuberoot(ulong n)
{
ulong ret = pow(n + 0.5, 1.0/3);
if (n < 100000000000001ULL)
return ret;
if (n >= 18446724184312856125ULL)
return 2642245ULL;
if (ret * ret * ret > n) {
ret--;
while (ret * ret * ret > n)
ret--;
return ret;
}
while ((ret + 1) * (ret + 1) * (ret + 1) <= n)
ret++;
return ret;
}
gives the correct result, but is slower than it needs to be.
This code is for a math library and it will be called many times from various functions. Speed is important, but you can't count on a warm cache (so suggestions like a 2,642,245-entry binary search are right out).
For comparison, here is code that correctly calculates the integer square root.
ulong squareroot(ulong a) {
ulong x = (ulong)sqrt((double)a);
if (x > 0xFFFFFFFF || x*x > a)
x--;
return x;
}

The book "Hacker's Delight" has algorithms for this and many other problems. The code is online here. EDIT: That code doesn't work properly with 64-bit ints, and the instructions in the book on how to fix it for 64-bit are somewhat confusing. A proper 64-bit implementation (including test case) is online here.
I doubt that your squareroot function works "correctly" - it should be ulong a for the argument, not n :) (but the same approach would work using cbrt instead of sqrt, although not all C math libraries have cube root functions).

I've adapted the algorithm presented in 1.5.2 (the kth root) in Modern Computer Arithmetic (Brent and Zimmerman). For the case of (k == 3), and given a 'relatively' accurate over-estimate of the initial guess - this algorithm seems to out-perform the 'Hacker's Delight' code above.
Not only that, but MCA as a text provides theoretical background as well as a proof of correctness and terminating criteria.
Provided that we can produce a 'relatively' good initial over-estimate, I haven't been able to find a case that exceeds (7) iterations. (Is this effectively related to 64-bit values having 2^6 bits?) Either way, it's an improvement over the (21) iterations in the HacDel code - with linear O(b) convergence, despite having a loop body that is evidently much faster.
The initial estimate I've used is based on a 'rounding up' of the number of significant bits in the value (x). Given (b) significant bits in (x), we can say: 2^(b - 1) <= x < 2^b. I state without proof (though it should be relatively easy to demonstrate) that: 2^ceil(b / 3) > x^(1/3)
static inline uint32_t u64_cbrt (uint64_t x)
{
uint64_t r0 = 1, r1;
/* IEEE-754 cbrt *may* not be exact. */
if (x == 0) /* cbrt(0) : */
return (0);
int b = (64) - __builtin_clzll(x);
r0 <<= (b + 2) / 3; /* ceil(b / 3) */
do /* quadratic convergence: */
{
r1 = r0;
r0 = (2 * r1 + x / (r1 * r1)) / 3;
}
while (r0 < r1);
return ((uint32_t) r1); /* floor(cbrt(x)); */
}
A crbt call probably isn't all that useful - unlike the sqrt call which can be efficiently implemented on modern hardware. That said, I've seen gains for sets of values under 2^53 (exactly represented in IEEE-754 doubles), which surprised me.
The only downside is the division by: (r * r) - this can be slow, as the latency of integer division continues to fall behind other advances in ALUs. The division by a constant: (3) is handled by reciprocal methods on any modern optimising compiler.
It's interesting that Intel's 'Icelake' microarchitecture will significantly improve integer division - an operation that seems to have been neglected for a long time. I simply won't trust the 'Hacker's Delight' answer until I can find a sound theoretical basis for it. And then I have to work out which variant is the 'correct' answer.

You could try a Newton's step to fix your rounding errors:
ulong r = (ulong)pow(n, 1.0/3);
if(r==0) return r; /* avoid divide by 0 later on */
ulong r3 = r*r*r;
ulong slope = 3*r*r;
ulong r1 = r+1;
ulong r13 = r1*r1*r1;
/* making sure to handle unsigned arithmetic correctly */
if(n >= r13) r+= (n - r3)/slope;
if(n < r3) r-= (r3 - n)/slope;
A single Newton step ought to be enough, but you may have off-by-one (or possibly more?) errors. You can check/fix those using a final check&increment step, as in your OQ:
while(r*r*r > n) --r;
while((r+1)*(r+1)*(r+1) <= n) ++r;
or some such.
(I admit I'm lazy; the right way to do it is to carefully check to determine which (if any) of the check&increment things is actually necessary...)

If pow is too expensive, you can use a count-leading-zeros instruction to get an approximation to the result, then use a lookup table, then some Newton steps to finish it.
int k = __builtin_clz(n); // counts # of leading zeros (often a single assembly insn)
int b = 64 - k; // # of bits in n
int top8 = n >> (b - 8); // top 8 bits of n (top bit is always 1)
int approx = table[b][top8 & 0x7f];
Given b and top8, you can use a lookup table (in my code, 8K entries) to find a good approximation to cuberoot(n). Use some Newton steps (see comingstorm's answer) to finish it.

// On my pc: Math.Sqrt 35 ns, cbrt64 <70ns, cbrt32 <25 ns, (cbrt12 < 10ns)
// cbrt64(ulong x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/acbrt.c.txt (acbrt1)
// cbrt32(uint x) is a C# version of:
// http://www.hackersdelight.org/hdcodetxt/icbrt.c.txt (icbrt1)
// Union in C#:
// http://www.hanselman.com/blog/UnionsOrAnEquivalentInCSairamasTipOfTheDay.aspx
using System.Runtime.InteropServices;
[StructLayout(LayoutKind.Explicit)]
public struct fu_32 // float <==> uint
{
[FieldOffset(0)]
public float f;
[FieldOffset(0)]
public uint u;
}
private static uint cbrt64(ulong x)
{
if (x >= 18446724184312856125) return 2642245;
float fx = (float)x;
fu_32 fu32 = new fu_32();
fu32.f = fx;
uint uy = fu32.u / 4;
uy += uy / 4;
uy += uy / 16;
uy += uy / 256;
uy += 0x2a5137a0;
fu32.u = uy;
float fy = fu32.f;
fy = 0.33333333f * (fx / (fy * fy) + 2.0f * fy);
int y0 = (int)
(0.33333333f * (fx / (fy * fy) + 2.0f * fy));
uint y1 = (uint)y0;
ulong y2, y3;
if (y1 >= 2642245)
{
y1 = 2642245;
y2 = 6981458640025;
y3 = 18446724184312856125;
}
else
{
y2 = (ulong)y1 * y1;
y3 = y2 * y1;
}
if (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
while (y3 > x)
{
y1 -= 1;
y2 -= 2 * y1 + 1;
y3 -= 3 * y2 + 3 * y1 + 1;
}
return y1;
}
do
{
y3 += 3 * y2 + 3 * y1 + 1;
y2 += 2 * y1 + 1;
y1 += 1;
}
while (y3 <= x);
return y1 - 1;
}
private static uint cbrt32(uint x)
{
uint y = 0, z = 0, b = 0;
int s = x < 1u << 24 ? x < 1u << 12 ? x < 1u << 06 ? x < 1u << 03 ? 00 : 03 :
x < 1u << 09 ? 06 : 09 :
x < 1u << 18 ? x < 1u << 15 ? 12 : 15 :
x < 1u << 21 ? 18 : 21 :
x >= 1u << 30 ? 30 : x < 1u << 27 ? 24 : 27;
do
{
y *= 2;
z *= 4;
b = 3 * y + 3 * z + 1 << s;
if (x >= b)
{
x -= b;
z += 2 * y + 1;
y += 1;
}
s -= 3;
}
while (s >= 0);
return y;
}
private static uint cbrt12(uint x) // x < ~255
{
uint y = 0, a = 0, b = 1, c = 0;
while (a < x)
{
y++;
b += c;
a += b;
c += 6;
}
if (a != x) y--;
return y;
}

Starting from the code within the GitHub gist from the answer of Fabian Giesen, I have arrived at the following, faster implementation:
#include <stdint.h>
static inline uint64_t icbrt(uint64_t x) {
uint64_t b, y, bits = 3*21;
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}
While the above is still somewhat slower than methods relying on the GNU specific __builtin_clzll, the above does not make use of compiler specifics and is thus completely portable.
The bits constant
Lowering the constant bits leads to faster computation, but the highest number x for which the function gives correct results is (1 << bits) - 1. Also, bits must be a multiple of 3 and be at most 64, meaning that its maximum value is really 3*21 == 63. With bits = 3*21, icbrt() thus works for input x <= 9223372036854775807. If we know that a program is working with limited x, say x < 1000000, then we can speed up the cube root computation by setting bits = 3*7, since (1 << 3*7) - 1 = 2097151 >= 1000000.
64-bit vs. 32-bit integers
Though the above is written for 64-bit integers, the logic is the same for 32-bit:
#include <stdint.h>
static inline uint32_t icbrt(uint32_t x) {
uint32_t b, y, bits = 3*7; /* or whatever is appropriate */
int s;
for (s = bits - 3; s >= 0; s -= 3) {
if ((x >> s) == 0)
continue;
x -= 1 << s;
y = 1;
for (s = s - 3; s >= 0; s -= 3) {
y += y;
b = 1 + 3*y*(y + 1);
if ((x >> s) >= b) {
x -= b << s;
y += 1;
}
}
return y;
}
return 0;
}

I would research how to do it by hand, and then translate that into a computer algorithm, working in base 2 rather than base 10.
We end up with an algorithm something like (pseudocode):
Find the largest n such that (1 << 3n) < input.
result = 1 << n.
For i in (n-1)..0:
if ((result | 1 << i)**3) < input:
result |= 1 << i.
We can optimize the calculation of (result | 1 << i)**3 by observing that the bitwise-or is equivalent to addition, refactoring to result**3 + 3 * i * result ** 2 + 3 * i ** 2 * result + i ** 3, caching the values of result**3 and result**2 between iterations, and using shifts instead of multiplication.

You can try and adapt this C algorithm :
#include <limits.h>
// return a number that, when multiplied by itself twice, makes N.
unsigned cube_root(unsigned n){
unsigned a = 0, b;
for (int c = sizeof(unsigned) * CHAR_BIT / 3 * 3 ; c >= 0; c -= 3) {
a <<= 1;
b = a + (a << 1), b = b * a + b + 1 ;
if (n >> c >= b)
n -= b << c, ++a;
}
return a;
}
Also there is :
// return the number that was multiplied by itself to reach N.
unsigned square_root(const unsigned num) {
unsigned a, b, c, d;
for (b = a = num, c = 1; a >>= 1; ++c);
for (c = 1 << (c & -2); c; c >>= 2) {
d = a + c;
a >>= 1;
if (b >= d)
b -= d, a += c;
}
return a;
}
Source

Related

Divide 64-bit integers as though the dividend is shifted left 64 bits, without having 128-bit types

Apologies for the confusing title. I'm not sure how to better describe what I'm trying to accomplish. I'm essentially trying to do the reverse of
getting the high half of a 64-bit multiplication in C for platforms where
int64_t divHi64(int64_t dividend, int64_t divisor) {
return ((__int128)dividend << 64) / (__int128)divisor;
}
isn't possible due to lacking support for __int128.
This can be done without a multi-word division
Suppose we want to do ⌊264 × x⁄y⌋ then we can transform the expression like this
The first term is trivially done as ((-y)/y + 1)*x as per this question How to compute 2⁶⁴/n in C?
The second term is equivalent to (264 % y)/y*x and is a little bit trickier. I've tried various ways but all need 128-bit multiplication and 128/64 division if using only integer operations. That can be done using the algorithms to calculate MulDiv64(a, b, c) = a*b/c in the below questions
Most accurate way to do a combined multiply-and-divide operation in 64-bit?
How to multiply a 64 bit integer by a fraction in C++ while minimizing error?
(a * b) / c MulDiv and dealing with overflow from intermediate multiplication
How can I multiply and divide 64-bit ints accurately?
However they may be slow, and if you have those functions you calculate the whole expression more easily like MulDiv64(x, UINT64_MAX, y) + x/y + something without messing up with the above transformation
Using long double seems to be the easiest way if it has 64 bits of precision or more. So now it can be done by (264 % y)/(long double)y*x
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t mod_y = UINT64_MAX % y + 1;
uint64_t result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (uint64_t)((mod_y/(long double)y)*x);
return result;
}
The overflow check was omitted for simplification. A slight modification will be needed if you need signed division
If you're targeting 64-bit Windows but you're using MSVC which doesn't have __int128 then now it has a 128-bit/64-bit divide intrinsic which simplifies the job significantly without a 128-bit integer type. You still need to handle overflow though because the div instruction will throw an exception on that case
uint64_t divHi64(uint64_t x, uint64_t y) {
uint64_t high, remainder;
uint64_t low = _umul128(UINT64_MAX, y, &high);
if (x <= high /* && 0 <= low */)
return _udiv128(x, 0, y, &remainder);
// overflow case
errno = EOVERFLOW;
return 0;
}
The overflow checking above is can be simplified to checking whether x < y, because if x >= y then the result will overflow
See also
Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
Exhaustive tests on 16/16 bit division shows that my solution works correctly for all cases. However you do need double even though float has more than 16 bits of precision, otherwise occasionally a less-than-one result will be returned. It may be fixed by adding an epsilon value before truncating: (uint64_t)((mod_y/(long double)y)*x + epsilon). That means you'll need __float128 (or the -m128bit-long-double option) in gcc for precise 64/64-bit output if you don't correct the result with epsilon. However that type is available on 32-bit targets, unlike __int128 which is supported only on 64-bit targets, so life will be a bit easier. Of course you can use the function as-is if just a very close result is needed
Below is the code I've used for verifying
#include <thread>
#include <iostream>
#include <limits>
#include <climits>
#include <mutex>
std::mutex print_mutex;
#define MAX_THREAD 8
#define NUM_BITS 27
#define CHUNK_SIZE (1ULL << NUM_BITS)
// typedef uint32_t T;
// typedef uint64_t T2;
// typedef double D;
typedef uint64_t T;
typedef unsigned __int128 T2; // the type twice as wide as T
typedef long double D;
// typedef __float128 D;
const D epsilon = 1e-14;
T divHi(T x, T y) {
T mod_y = std::numeric_limits<T>::max() % y + 1;
T result = ((-y)/y + 1)*x;
if (mod_y != y)
result += (T)((mod_y/(D)y)*x + epsilon);
return result;
}
void testdiv(T midpoint)
{
T begin = midpoint - CHUNK_SIZE/2;
T end = midpoint + CHUNK_SIZE/2;
for (T i = begin; i != end; i++)
{
T x = i & ((1 << NUM_BITS/2) - 1);
T y = CHUNK_SIZE/2 - (i >> NUM_BITS/2);
// if (y == 0)
// continue;
auto q1 = divHi(x, y);
T2 q2 = ((T2)x << sizeof(T)*CHAR_BIT)/y;
if (q2 != (T)q2)
{
// std::lock_guard<std::mutex> guard(print_mutex);
// std::cout << "Overflowed: " << x << '&' << y << '\n';
continue;
}
else if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << (T)q2 << '\n';
}
}
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << "Done testing [" << begin << ", " << end << "]\n";
}
uint16_t divHi16(uint32_t x, uint32_t y) {
uint32_t mod_y = std::numeric_limits<uint16_t>::max() % y + 1;
int result = ((((1U << 16) - y)/y) + 1)*x;
if (mod_y != y)
result += (mod_y/(double)y)*x;
return result;
}
void testdiv16(uint32_t begin, uint32_t end)
{
for (uint32_t i = begin; i != end; i++)
{
uint32_t y = i & 0xFFFF;
if (y == 0)
continue;
uint32_t x = i & 0xFFFF0000;
uint32_t q2 = x/y;
if (q2 > 0xFFFF) // overflowed
continue;
uint16_t q1 = divHi16(x >> 16, y);
if (q1 != q2)
{
std::lock_guard<std::mutex> guard(print_mutex);
std::cout << x << '/' << y << ": " << q1 << " != " << q2 << '\n';
}
}
}
int main()
{
std::thread t[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
t[i] = std::thread(testdiv, std::numeric_limits<T>::max()/MAX_THREAD*i);
for (int i = 0; i < MAX_THREAD; i++)
t[i].join();
std::thread t2[MAX_THREAD];
constexpr uint32_t length = std::numeric_limits<uint32_t>::max()/MAX_THREAD;
uint32_t begin, end = length;
for (int i = 0; i < MAX_THREAD - 1; i++)
{
begin = end;
end += length;
t2[i] = std::thread(testdiv16, begin, end);
}
t2[MAX_THREAD - 1] = std::thread(testdiv, end, UINT32_MAX);
for (int i = 0; i < MAX_THREAD; i++)
t2[i].join();
std::cout << "Done\n";
}

64 bit / 64 bit remainder finding algorithm on a 32 bit processor?

I know that similar questions has been asked in the past, but I have implemented after a long process the algorithm to find the quotient correctly using the division by repeated subtraction method. But I am not able to find out the remainder from this approach. Is there any quick and easy way for finding out remainder in 64bit/64bit division on 32bit processor. To be more precise I am trying to implement
ulldiv_t __aeabi_uldivmod(
unsigned long long n, unsigned long long d)
Referenced in this document http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf
What? If you do repeated subtraction (which sounds really basic), then isn't it as simple as whatever you have left when you can't do another subtraction is the remainder?
At least that's the naïve intuitive way:
uint64_t simple_divmod(uint64_t n, uint64_t d)
{
if (n == 0 || d == 0)
return 0;
uint64_t q = 0;
while (n >= d)
{
++q;
n -= d;
}
return n;
}
Or am I missing the boat, here?
Of course this will be fantastically slow for large numbers, but this is repeated subtraction. I'm sure (even without looking!) there are more advanced algorithms.
This is a division algorithm, run in O(log(n/d))
uint64_t slow_division(uint64_t n, uint64_t d)
{
uint64_t i = d;
uint64_t q = 0;
uint64_t r = n;
while (n > i && (i >> 63) == 0) i <<= 1;
while (i >= d) {
q <<= 1;
if (r >= i) { r -= i; q += 1; }
i >>= 1;
}
// quotient is q, remainder is r
return q; // return r
}
q (quotient) can be removed if you need only r (remainder). You can implement each of the intermediate variables i,q,r as a pair of uint32_t, e.g. i_lo, i_hi, q_lo, q_hi ..... shift, add and subtract lo and hi are simple operations.
#define left_shift1 (a_hi, a_lo) // a <<= 1
{
a_hi = (a_hi << 1) | (a_lo >> 31)
a_lo = (a_lo << 1)
}
#define subtraction (a_hi, a_lo, b_hi, b_lo) // a-= b
{
uint32_t t = a_lo
a_lo -= b_lo
t = (a_lo > t) // borrow
a_hi -= b_hi + t
}
#define right_shift63 (a_hi, a_lo) // a >> 63
{
a_lo = a_hi >> 31;
a_hi = 0;
}
and so on.
0 as divisor is still an unresolved challenge :-) .

Fast modular multiplication modulo prime for linear congruential generator in C

I am trying to implement a random-number generator with Mersenne prime (231-1) as the modulus. The following working code was based on several related posts:
How do I extract specific 'n' bits of a 32-bit unsigned integer in C?
Fast multiplication and subtraction modulo a prime
Fast multiplication modulo 2^16 + 1
However,
It does not work with uint32_t hi, lo;, which means I do not understand signed vs. unsigned aspect of the problem.
Based on #2 above, I was expecting the answer to be (hi+lo). Which means, I do not understand why the following statement is needed.
if (x1 > r)
x1 += r + 2;
Can someone please clarify the source of my confusion?
Can the code itself be improved?
Should the generator avoid 0 or 231-1 as a seed?
How would the code change for a prime (2p-k)?
Original code
#include <inttypes.h>
// x1 = a*x0 (mod 2^31-1)
int32_t lgc_m(int32_t a, int32_t x)
{
printf("x %"PRId32"\n", x);
if (x == 2147483647){
printf("x1 %"PRId64"\n", 0);
return (0);
}
uint64_t c, r = 1;
c = (uint64_t)a * (uint64_t)x;
if (c < 2147483647){
printf("x1 %"PRId64"\n", c);
return (c);
}
int32_t hi=0, lo=0;
int i, p = 31;//2^31-1
for (i = 1; i < p; ++i){
r |= 1 << i;
}
lo = (c & r) ;
hi = (c & ~r) >> p;
uint64_t x1 = (uint64_t ) (hi + lo);
// NOT SURE ABOUT THE NEXT STATEMENT
if (x1 > r)
x1 += r + 2;
printf("c %"PRId64"\n", c);
printf("r %"PRId64"\n", r);
printf("\tlo %"PRId32"\n", lo);
printf("\thi %"PRId32"\n", hi);
printf("x1 %"PRId64"\n", x1);
printf("\n" );
return((int32_t) x1);
}
int main(void)
{
int32_t r;
r = lgc_m(1583458089, 1);
r = lgc_m(1583458089, 2000000000);
r = lgc_m(1583458089, 2147483646);
r = lgc_m(1583458089, 2147483647);
return(0);
}
The following if statement
if (x1 > r)
x1 += r + 2;
should be written as
if (x1 > r)
x1 -= r;
Both results are the same modulo 2^31:
x1 + r + 2 = x1 + 2^31 - 1 + 2 = x1 + 2^31 + 1
x1 - r = x1 - (2^31 - 1) = x1 - 2^31 + 1
The first solution overflows an int32_t and assumes that conversion from uint64_t to int32_t is modulo 2^31. While many C compilers handle the conversion this way, this is not mandated by the C standard. The actual result is implementation-defined.
The second solution avoids the overflow and works with both int32_t and uint32_t.
You can also use an integer constant for r:
uint64_t r = 0x7FFFFFFF; // 2^31 - 1
Or simply
uint64_t r = INT32_MAX;
EDIT: For primes of the form 2^p-k, you have to use masks with p bits and calculate the result with
uint32_t x1 = (k * hi + lo) % ((1 << p) - k)
If k * hi + lo can overflow a uint32_t (that is (k + 1) * (2^p - 1) >= 2^32), you have to use 64-bit arithmetic:
uint32_t x1 = ((uint64_t)a * x) % ((1 << p) - k)
Depending on the platform, the latter might be faster anyway.
Sue provided this as a solution:
With some experimentation (new code at the bottom), I was able to use
uint32_t, which further suggests that I do not understand how the
signed integers work with bit operations.
The following code uses uint32_t for input as well as hi and lo.
#include <inttypes.h>
// x1 = a*x0 (mod 2^31-1)
uint32_t lgc_m(uint32_t a, uint32_t x)
{
printf("x %"PRId32"\n", x);
if (x == 2147483647){
printf("x1 %"PRId64"\n", 0);
return (0);
}
uint64_t c, r = 1;
c = (uint64_t)a * (uint64_t)x;
if (c < 2147483647){
printf("x1 %"PRId64"\n", c);
return (c);
}
uint32_t hi=0, lo=0;
int i, p = 31;//2^31-1
for (i = 1; i < p; ++i){
r |= 1 << i;
}
hi = c >> p;
lo = (c & r) ;
uint64_t x1 = (uint64_t ) ((hi + lo) );
// NOT SURE ABOUT THE NEXT STATEMENT
if (x1 > r){
printf("x1 - r = %"PRId64"\n", x1- r);
x1 -= r;
}
printf("c %"PRId64"\n", c);
printf("r %"PRId64"\n", r);
printf("\tlo %"PRId32"\n", lo);
printf("\thi %"PRId32"\n", hi);
printf("x1 %"PRId64"\n", x1);
printf("\n" );
return((uint32_t) x1);
}
int main(void)
{
uint32_t r;
r = lgc_m(1583458089, 1583458089);
r = lgc_m(1583458089, 2147483645);
return(0);
}
The issue was that my assumption that the reduction will be complete
after one pass. If (x > 231-1), then by definition the
reduction has not occurred and a second pass is necessary. Subtracting
231-1, in that case does the trick. In the second attempt
above, r = 2^31-1 and is therefore the modulus. x -= r achieves
the final reduction.
Perhaps someone with expertise in random numbers or modular reduction
could explain it better.
Cleaned function without printf()s.
uint32_t lgc_m(uint32_t a, uint32_t x){
uint64_t c, x1, m = 2147483647; //modulus: m = 2^31-1
if (x == m)
return (0);
c = (uint64_t)a * (uint64_t)x;
if (c < m)//no reduction necessary
return (c);
uint32_t hi, lo, p = 31;//2^p-1, p = 31
hi = c >> p;
lo = c & m;
x1 = (uint64_t)(hi + lo);
if (x1 > m){//one more pass needed
//this block can be replaced by x1 -= m;
hi = x1 >> p;
lo = (x1 & m);
x1 = (uint64_t)(hi + lo);
}
return((uint32_t) x1);
}

Floating point emulation or Fixed Point for numbers in a given range

I have a co-processor which does not have floating point support. I tried to use 32 bit fix point, but it is unable to work on very small numbers. My numbers range from 1 to 1e-18. One way is to use floating point emulation, but it is too slow. Can we make it faster in this case where we know the numbers won't be greater than 1 and smaller than 1e-18. Or is there a way to make fix point work on very small numbers.
It is not possible for a 32-bit fixed-point encoding to represent numbers from 10–18 to 1. This is immediately obvious from the fact that the span from 10-18 is a ratio of 1018, but the non-zero encodings of a 32-bit integer span a ratio of less than 232, which is much less than 1018. Therefore, no choice of scale for the fixed-point encoding will provide the desired span.
So a 32-bit fixed-point encoding will not work, and you must use some other technique.
In some applications, it may be suitable to use multiple fixed-point encodings. That is, various input values would be encoded with a fixed-point encoding but each with a scale suitable to it, and intermediate values and the outputs would also have customized scales. Obviously, this is possible only if suitable scales can be determined at design time. Otherwise, you should abandon 32-bit fixed-point encodings and consider alternatives.
Will simplified 24-bit floating point be fast enough and accurate enough?:
#include <stdio.h>
#include <limits.h>
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned myfloat;
#else
typedef unsigned long myfloat;
#endif
#define MF_EXP_BIAS 0x80
myfloat mfadd(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16;
if (ea > eb)
{
a &= 0xFFFF;
b = (b & 0xFFFF) >> (ea - eb);
if ((a += b) > 0xFFFF)
a >>= 1, ++ea;
return a | ((myfloat)ea << 16);
}
else if (eb > ea)
{
b &= 0xFFFF;
a = (a & 0xFFFF) >> (eb - ea);
if ((b += a) > 0xFFFF)
b >>= 1, ++eb;
return b | ((myfloat)eb << 16);
}
else
{
return (((a & 0xFFFF) + (b & 0xFFFF)) >> 1) | ((myfloat)++ea << 16);
}
}
myfloat mfmul(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16, e = ea + eb - MF_EXP_BIAS;
myfloat p = ((a & 0xFFFF) * (b & 0xFFFF)) >> 16;
return p | ((myfloat)e << 16);
}
myfloat double2mf(double x)
{
myfloat f;
unsigned e = MF_EXP_BIAS + 16;
if (x <= 0)
return 0;
while (x < 0x8000)
x *= 2, --e;
while (x >= 0x10000)
x /= 2, ++e;
f = x;
return f | ((myfloat)e << 16);
}
double mf2double(myfloat f)
{
double x;
unsigned e = (f >> 16) - 16;
if ((f & 0xFFFF) == 0)
return 0;
x = f & 0xFFFF;
while (e > MF_EXP_BIAS)
x *= 2, --e;
while (e < MF_EXP_BIAS)
x /= 2, ++e;
return x;
}
int main(void)
{
double testConvData[] = { 1e-18, .25, 0.3333333, .5, 1, 2, 3.141593, 1e18 };
unsigned i;
for (i = 0; i < sizeof(testConvData) / sizeof(testConvData[0]); i++)
printf("%e -> 0x%06lX -> %e\n",
testConvData[i],
(unsigned long)double2mf(testConvData[i]),
mf2double(double2mf(testConvData[i])));
printf("300 * 5 = %e\n", mf2double(mfmul(double2mf(300),double2mf(5))));
printf("500 + 3 = %e\n", mf2double(mfadd(double2mf(500),double2mf(3))));
printf("1e18 * 1e-18 = %e\n", mf2double(mfmul(double2mf(1e18),double2mf(1e-18))));
printf("1e-18 + 2e-18 = %e\n", mf2double(mfadd(double2mf(1e-18),double2mf(2e-18))));
printf("1e-16 + 1e-18 = %e\n", mf2double(mfadd(double2mf(1e-16),double2mf(1e-18))));
return 0;
}
Output (ideone):
1.000000e-18 -> 0x459392 -> 9.999753e-19
2.500000e-01 -> 0x7F8000 -> 2.500000e-01
3.333333e-01 -> 0x7FAAAA -> 3.333282e-01
5.000000e-01 -> 0x808000 -> 5.000000e-01
1.000000e+00 -> 0x818000 -> 1.000000e+00
2.000000e+00 -> 0x828000 -> 2.000000e+00
3.141593e+00 -> 0x82C90F -> 3.141541e+00
1.000000e+18 -> 0xBCDE0B -> 9.999926e+17
300 * 5 = 1.500000e+03
500 + 3 = 5.030000e+02
1e18 * 1e-18 = 9.999390e-01
1e-18 + 2e-18 = 2.999926e-18
1e-16 + 1e-18 = 1.009985e-16
Subtraction is left as an exercise. Ditto for better conversion routines.
Use 64 bit fixed point and be done with it.
Compared with 32 bit fixed point it will be four times slower for multiplication, but it will still be far more efficient than float emulation.
In embedded systems I'd suggest using 16+32, 16+16, 8+16 or 8+24 bit redundant floating point representation, where each number is simply M * 2^exp.
In this case you can choose to represent zero with both M=0 and exp=0; There are 16-32 representations for each power of 2 -- and that mainly makes comparison a bit harder than typically. Also one can postpone normalization e.g. after subtraction.

how to calculate (a times b) divided by c only using 32-bit integer types even if a times b would not fit such a type

Consider the following as a reference implementation:
/* calculates (a * b) / c */
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint64_t x = a;
x = x * b;
x = x / c;
return x;
}
I am interested in an implementation (in C or pseudocode) that does not require a 64-bit integer type.
I started sketching an implementation that outlines like this:
/* calculates (a * b) / c */
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t d1, d2, d1d2;
d1 = (1 << 10);
d2 = (1 << 10);
d1d2 = (1 << 20); /* d1 * d2 */
return ((a / d1) * (b /d2)) / (c / d1d2);
}
But the difficulty is to pick values for d1 and d2 that manage to avoid the overflow ((a / d1) * (b / d2) <= UINT32_MAX) and minimize the error of the whole calculation.
Any thoughts?
I have adapted the algorithm posted by Paul for unsigned ints (by omitting the parts that are dealing with signs). The algorithm is basically Ancient Egyptian multiplication of a with the fraction floor(b/c) + (b%c)/c (with the slash denoting real division here).
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
while(a)
{
if (a & 1)
{
q += qn;
r += rn;
if (r >= c)
{
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn <<= 1;
if (rn >= c)
{
qn++;
rn -= c;
}
}
return q;
}
This algorithm will yield the exact answer as long as it fits in 32 bits. You can optionally also return the remainder r.
The simplest way would be converting the intermediar result to 64 bits, but, depending on value of c, you could use another approach:
((a/c)*b + (a%c)*(b/c) + ((a%c)*(b%c))/c
The only problem is that the last term could still overflow for large values of c. still thinking about it..
You can first divide a by c and also get the reminder of the division, and multiply the reminder with b before dividing it by c. That way you only lose data in the last division, and you get the same result as making the 64 bit division.
You can rewrite the formula like this (where \ is integer division):
a * b / c =
(a / c) * b =
(a \ c + (a % c) / c) * b =
(a \ c) * b + ((a % c) * b) / c
By making sure that a >= b, you can use larger values before they overflow:
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c) {
uint32_t hi = a > b ? a : b;
uint32_t lo = a > b ? b : a;
return (hi / c) * lo + (hi % c) * lo / c;
}
Another approach would be to loop addition and subtraction instead of multiplying and dividing, but that is of course a lot more work:
uint32_t muldiv(uint32_t a, uint32_t b, uint32_t c) {
uint32_t hi = a > b ? a : b;
uint32_t lo = a > b ? b : a;
uint32_t sum = 0;
uint32_t cnt = 0;
for (uint32_t i = 0; i < hi; i++) {
sum += lo;
while (sum >= c) {
sum -= c;
cnt++;
}
}
return cnt;
}
Searching on www.google.com/codesearch turns up a number of implementations, including this wonderfuly obvious one. I particularly like the extensive comments and well chosen variable names
INT32 muldiv(INT32 a, INT32 b, INT32 c)
{ INT32 q=0, r=0, qn, rn;
int qneg=0, rneg=0;
if (c==0) c=1;
if (a<0) { qneg=!qneg; rneg=!rneg; a = -a; }
if (b<0) { qneg=!qneg; rneg=!rneg; b = -b; }
if (c<0) { qneg=!qneg; c = -c; }
qn = b / c;
rn = b % c;
while(a)
{ if (a&1) { q += qn;
r += rn;
if(r>=c) { q++; r -= c; }
}
a >>= 1;
qn <<= 1;
rn <<= 1;
if (rn>=c) {qn++; rn -= c; }
}
result2 = rneg ? -r : r;
return qneg ? -q : q;
}
http://www.google.com/codesearch/p?hl=en#HTrPUplLEaU/users/mr/MCPL/mcpl.tgz|gIE-sNMlwIs/MCPL/mintcode/sysc/mintsys.c&q=muldiv%20lang:c
I implemented the Sven's code as UINT16, to intensively test it:
uint16_t muldiv16(uint16_t a, uint16_t b, uint16_t c);
int main(int argc, char *argv[]){
uint32_t a;
uint32_t b;
uint32_t c;
uint16_t r1, r2;
// ~167 days, estimated on i7 6700k, single thread.
// Split the 'a' range, to run several instances of this code on multi-cores processor
// ~1s, with an UINT8 implementation
for(a=0; a<=UINT16_MAX; a++){
for(b=0; b<=UINT16_MAX; b++){
for(c=1; c<=UINT16_MAX; c++){
r1 = uint16_t( a*b/c );
r2 = muldiv16(uint16_t(a), uint16_t(b), uint16_t(c));
if( r1 != r2 ){
std::cout << "Err: " << a << " * " << b << " / " << c << ", result: " << r2 << ", exected: " << r1 << std::endl;
return -1;
}
}
}
std::cout << a << std::endl
}
std::cout << "Done." << std::endl;
return 0;
}
Unfortunately, it seems that it is limited to UINT31 for 'b' (0-2147483647).
Here is my correction, that seems to work (not completed the test on UINT16, but run a lot. Completed on UINT8).
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
while(a)
{
if (a & 1)
{
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000UL;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
return q;
}
Edit: an improvement, that returns the remainder, manages the round, warns about overflow and, of course, manages the full range of UINT32 for a, b and c:
typedef enum{
ROUND_DOWNWARD=0,
ROUND_TONEAREST,
ROUND_UPWARD
}ROUND;
//remainder is always positive for ROUND_DOWN ( a * b = c * q + remainder )
//remainder is always negative for ROUND_UPWARD ( a * b = c * q - remainder )
//remainder is signed for ROUND_CLOSEST ( a * b = c * q + sint32_t(remainder) )
uint32_t muldiv32(uint32_t a, uint32_t b, uint32_t c, uint32_t *remainder, ROUND round, uint8_t *ovf)
{
uint32_t q = 0; // the quotient
uint32_t r = 0; // the remainder
uint32_t qn = b / c;
uint32_t rn = b % c;
uint32_t r_carry;
uint32_t rn_carry;
uint8_t o = 0;
uint8_t rup;
while(a)
{
if (a & 1)
{
o |= (q > UINT32_MAX-qn);
q += qn;
r_carry = (r > UINT32_MAX-rn);
r += rn;
if (r >= c || r_carry)
{
o |= (q == UINT32_MAX);
q++;
r -= c;
}
}
a >>= 1;
qn <<= 1;
rn_carry = rn & 0x80000000;
rn <<= 1;
if (rn >= c || rn_carry)
{
qn++;
rn -= c;
}
}
rup = (round == ROUND_UPWARD && r);
rup |= (round == ROUND_TONEAREST && ((r<<1) >= c || r & 0x80000000));
if(rup)
{ //round
o |= (q == UINT32_MAX);
q++;
r = (round == ROUND_UPWARD) ? c-r : r-c;
}
if(remainder)
*remainder = r;
if(ovf)
*ovf = o;
return q;
}
Maybe there could exist another approach, perhaps even more efficient:
8-bits, 16-bits and 32-bits MCU are able to compute 64-bits calculations (long long int).
Anyone known how the compilers emulate it?
Edit 2:
Here is some interresting timings, on 8-bits MCU:
UINT8 x UINT8 / UINT8: 3.5µs
UINT16 x UINT16 / UINT16: 22.5µs, muldiv8: 29.9 to 45.3µs
UINT32 x UINT32 / UINT32: 84µs, muldiv16: 120 to 189µs
FLOAT32 * FLOAT32 / FLOAT32: 40.2 ot 135.5µs, muldiv32: 1.193 to 1.764ms
And on 32-bits MCU:
Type - optimized code - without optimization
UINT32: 521ns - 604ns
UINT64: 2958ns - 3313ns
FLOAT32: 2563ns - 2688ns
muldiv32: 6791ns - 25375ns
So, the compilers are clever than this C algorithm.
And it is always better to work with float variables (even without FPU) than whith integer bigger than the native registers (even though float32 has worst precision than uint32, starting 16777217).
Edit3: Ok, so: my N-bits MCU are using a N-bits MUL N-bits native instruction, that produce a 2N-bits result, stored into two N-Bits registers.
Here, you can found a C implementation (prefer the EasyasPi's solution)
But they don't have 2N-bits DIV N-bits native instruction. Instead, they are using the __udivdi3 function from gcc, with loops and 2N-bits variables (here, UINT64). So, this cannot be a solution for the original question.
If b and c are both constants, you can calculate the result very simply using Egyptian fractions.
For example. y = a * 4 / 99 can be written as
y = a / 25 + a / 2475
You can express any fraction as a sum of Egyptian fractions, as explained in answers to Egyptian Fractions in C.
Having b and c fixed in advance might seem like a bit of a restriction, but this method is a lot simpler than the general case answered by others.
I suppose there are reasons you can't do
x = a/c;
x = x*b;
are there? And maybe add
y = b/c;
y = y*a;
if ( x != y )
return ERROR_VALUE;
Note that, since you're using integer division, a*b/c and a/c*b might lead to different values if c is bigger than a or b. Also, if both a and b are smaller than c it won't work.

Resources