Assuming that uint is the largest integral type on my fixed-point platform, I have:
uint func(uint a, uint b, uint c);
Which needs to return a good approximation of a * b / c.
The value of c is greater than both the value of a and the value of b.
So we know for sure that the value of a * b / c would fit in a uint.
However, the value of a * b itself overflows the size of a uint.
So one way to compute the value of a * b / c would be:
return a / c * b;
Or even:
if (a > b)
return a / c * b;
return b / c * a;
However, the value of c is greater than both the value of a and the value of b.
So the suggestion above would simply return zero.
I need to reduce a * b and c proportionally, but again - the problem is that a * b overflows.
Ideally, I would be able to:
Replace a * b with uint(-1)
Replace c with uint(-1) / a / b * c.
But no matter how I order the expression uint(-1) / a / b * c, I encounter a problem:
uint(-1) / a / b * c is truncated to zero because of uint(-1) / a / b
uint(-1) / a * c / b overflows because of uint(-1) / a * c
uint(-1) * c / a / b overflows because of uint(-1) * c
How can I tackle this scenario in order to find a good approximation of a * b / c?
Edit 1
I do not have things such as _umul128 on my platform, when the largest integral type is uint64. My largest type is uint, and I have no support for anything larger than that (neither on the HW level, nor in some pre-existing standard library).
My largest type is uint.
Edit 2
In response to numerous duplicate suggestions and comments:
I do not have some "larger type" at hand, which I can use for solving this problem. That is why the opening statement of the question is:
Assuming that uint is the largest integral type on my fixed-point platform
I am assuming that no other type exists, neither on the SW layer (via some built-in standard library) nor on the HW layer.
needs to return a good approximation of a * b / c
My largest type is uint
both a and b are smaller than c
Variation on this 32-bit problem:
Algorithm: Scale a, b to not overflow
SQRT_MAX_P1 as a compile time constant of sqrt(uint_MAX + 1)
sh = 0;
if (c >= SQRT_MAX_P1) {
while (|a| >= SQRT_MAX_P1) a/=2, sh++
while (|b| >= SQRT_MAX_P1) b/=2, sh++
while (|c| >= SQRT_MAX_P1) c/=2, sh--
}
result = a*b/c
shift result by sh.
With an n-bit uint, I expect the result to be correct to at least about n/2 significant digits.
Could improve things by taking advantage of the smaller of a,b being less than SQRT_MAX_P1. More on that later if interested.
Example
#include <inttypes.h>
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
// https://stackoverflow.com/a/4589384/2410359
#define UINTMAX_WIDTH (IMAX_BITS(UINTMAX_MAX))
#define SQRT_UINTMAX_P1 (((uintmax_t)1ull) << (UINTMAX_WIDTH/2))
uintmax_t muldiv_about(uintmax_t a, uintmax_t b, uintmax_t c) {
int shift = 0;
if (c > SQRT_UINTMAX_P1) {
while (a >= SQRT_UINTMAX_P1) {
a /= 2; shift++;
}
while (b >= SQRT_UINTMAX_P1) {
b /= 2; shift++;
}
while (c >= SQRT_UINTMAX_P1) {
c /= 2; shift--;
}
}
uintmax_t r = a * b / c;
if (shift > 0) r <<= shift;
if (shift < 0) r >>= shift;
return r;
}
#include <stdio.h>
int main() {
uintmax_t a = 12345678;
uintmax_t b = 4235266395;
uintmax_t c = 4235266396;
uintmax_t r = muldiv_about(a,b,c);
printf("%ju\n", r);
}
Output with 32-bit math (Precise answer is 12345677)
12345600
Output with 64-bit math
12345677
Here is another approach that uses recursion and minimal approximation to achieve high precision.
First the code and below an explanation.
Code:
uint32_t bp(uint32_t a) {
uint32_t b = 0;
while (a!=0)
{
++b;
a >>= 1;
};
return b;
}
int mul_no_ovf(uint32_t a, uint32_t b)
{
return ((bp(a) + bp(b)) <= 32);
}
uint32_t f(uint32_t a, uint32_t b, uint32_t c)
{
if (mul_no_ovf(a, b))
{
return (a*b) / c;
}
uint32_t m = c / b;
++m;
uint32_t x = m*b - c;
// So m * b == c + x where x < b and m >= 2
uint32_t n = a/m;
uint32_t r = a % m;
// So a*b == n * (c + x) + r*b == n*c + n*x + r*b where r*b < c
// Approximation: get rid of the r*b part
uint32_t res = n;
if (r*b > c/2) ++res;
return res + f(n, x, c);
}
Explanation:
The multiplication a * b can be written as a sum of b
a * b = b + b + .... + b
Since b < c we can take a number m of these b so that (m-1)*b < c <= m*b, like
(b + b + ... + b) + (b + b + ... + b) + .... + b + b + b
\---------------/ \---------------/ + \-------/
m*b + m*b + .... + r*b
\-------------------------------------/
n times m*b
so we have
a*b = n*m*b + r*b
where r*b < c and m*b > c. Consequently, m*b is equal to c + x, so we have
a*b = n*(c + x) + r*b = n*c + n*x + r*b
Divide by c :
a*b/c = (n*c + n*x + r*b)/c = n + n*x/c + r*b/c
The values m, n, x, r can all be calculated from a, b and c without any loss of
precision using integer division (/) and remainder (%).
The approximation is to look at r*b (which is less than c) and "add zero" when r*b<=c/2
and "add one" when r*b>c/2.
So now there are two possibilities:
1) a*b = n + n*x/c
2) a*b = (n + 1) + n*x/c
So the problem (i.e. calculating a*b/c) has been changed to the form
MULDIV(a1,b1,c) = NUMBER + MULDIV(a2,b2,c)
where a2,b2 is less than a1,b2. Consequently, recursion can be used until
a2*b2 no longer overflows (and the calculation can be done directly).
I've established a solution which work in O(1) complexity (no loops):
typedef unsigned long long uint;
typedef struct
{
uint n;
uint d;
}
fraction;
uint func(uint a, uint b, uint c);
fraction reducedRatio(uint n, uint d, uint max);
fraction normalizedRatio(uint a, uint b, uint scale);
fraction accurateRatio(uint a, uint b, uint scale);
fraction toFraction(uint n, uint d);
uint roundDiv(uint n, uint d);
uint func(uint a, uint b, uint c)
{
uint hi = a > b ? a : b;
uint lo = a < b ? a : b;
fraction f = reducedRatio(hi, c, (uint)(-1) / lo);
return f.n * lo / f.d;
}
fraction reducedRatio(uint n, uint d, uint max)
{
fraction f = toFraction(n, d);
if (n > max || d > max)
f = normalizedRatio(n, d, max);
if (f.n != f.d)
return f;
return toFraction(1, 1);
}
fraction normalizedRatio(uint a, uint b, uint scale)
{
if (a <= b)
return accurateRatio(a, b, scale);
fraction f = accurateRatio(b, a, scale);
return toFraction(f.d, f.n);
}
fraction accurateRatio(uint a, uint b, uint scale)
{
uint maxVal = (uint)(-1) / scale;
if (a > maxVal)
{
uint c = a / (maxVal + 1) + 1;
a /= c; // we can now safely compute `a * scale`
b /= c;
}
if (a != b)
{
uint n = a * scale;
uint d = a + b; // can overflow
if (d >= a) // no overflow in `a + b`
{
uint x = roundDiv(n, d); // we can now safely compute `scale - x`
uint y = scale - x;
return toFraction(x, y);
}
if (n < b - (b - a) / 2)
{
return toFraction(0, scale); // `a * scale < (a + b) / 2 < MAXUINT256 < a + b`
}
return toFraction(1, scale - 1); // `(a + b) / 2 < a * scale < MAXUINT256 < a + b`
}
return toFraction(scale / 2, scale / 2); // allow reduction to `(1, 1)` in the calling function
}
fraction toFraction(uint n, uint d)
{
fraction f = {n, d};
return f;
}
uint roundDiv(uint n, uint d)
{
return n / d + n % d / (d - d / 2);
}
Here is my test:
#include <stdio.h>
int main()
{
uint a = (uint)(-1) / 3; // 0x5555555555555555
uint b = (uint)(-1) / 2; // 0x7fffffffffffffff
uint c = (uint)(-1) / 1; // 0xffffffffffffffff
printf("0x%llx", func(a, b, c)); // 0x2aaaaaaaaaaaaaaa
return 0;
}
You can cancel prime factors as follows:
uint gcd(uint a, uint b)
{
uint c;
while (b)
{
a %= b;
c = a;
a = b;
b = c;
}
return a;
}
uint func(uint a, uint b, uint c)
{
uint temp = gcd(a, c);
a = a/temp;
c = c/temp;
temp = gcd(b, c);
b = b/temp;
c = c/temp;
// Since you are sure the result will fit in the variable, you can simply
// return the expression you wanted after having those terms canceled.
return a * b / c;
}
I'm hoping to optimise an implementation of SHA-1 for an 8-bit MCU (8051-based). The input data is only 8-bytes, so I wonder if something could be done to improve this macro:
#define S(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))
The issue I have is that when macro P calls S with S(b, 30), it takes around 60us to complete. Since there're 80 calls to P, it totals to around 4.8ms.
If I'm correct, S(x,n) expects x to be a uint32. Given the rather small input size, could the number of shifts be reduced by making x smaller, e.g., uint8?
If so, is this the only change needed? From:
#define S(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))
To:
#define S(x,n) ((x << n) | ((x & 0xFF) >> (8 - n)))
From:
void sha1_process( sha1_context *ctx, uint8 data[64] )
{
uint32 temp, W[16], A, B, C, D, E;
// ...
To:
void sha1_process( sha1_context *ctx, uint8 data[64] )
{
uint8 temp, W[16], A, B, C, D, E;
// ...
Here's the complete code:
#include <string.h>
#include "sha1.h"
#define GET_UINT32(n,b,i) \
{ \
(n) = ( (uint32) (b)[(i) ] << 24 ) \
| ( (uint32) (b)[(i) + 1] << 16 ) \
| ( (uint32) (b)[(i) + 2] << 8 ) \
| ( (uint32) (b)[(i) + 3] ); \
}
#define PUT_UINT32(n,b,i) \
{ \
(b)[(i) ] = (uint8) ( (n) >> 24 ); \
(b)[(i) + 1] = (uint8) ( (n) >> 16 ); \
(b)[(i) + 2] = (uint8) ( (n) >> 8 ); \
(b)[(i) + 3] = (uint8) ( (n) ); \
}
void sha1_starts( sha1_context *ctx )
{
ctx->total[0] = 0;
ctx->total[1] = 0;
ctx->state[0] = 0x67452301;
ctx->state[1] = 0xEFCDAB89;
ctx->state[2] = 0x98BADCFE;
ctx->state[3] = 0x10325476;
ctx->state[4] = 0xC3D2E1F0;
}
void sha1_process( sha1_context *ctx, uint8 data[64] )
{
uint32 temp, W[16], A, B, C, D, E;
GET_UINT32( W[0], data, 0 );
GET_UINT32( W[1], data, 4 );
GET_UINT32( W[2], data, 8 );
GET_UINT32( W[3], data, 12 );
GET_UINT32( W[4], data, 16 );
GET_UINT32( W[5], data, 20 );
GET_UINT32( W[6], data, 24 );
GET_UINT32( W[7], data, 28 );
GET_UINT32( W[8], data, 32 );
GET_UINT32( W[9], data, 36 );
GET_UINT32( W[10], data, 40 );
GET_UINT32( W[11], data, 44 );
GET_UINT32( W[12], data, 48 );
GET_UINT32( W[13], data, 52 );
GET_UINT32( W[14], data, 56 );
GET_UINT32( W[15], data, 60 );
#define S(x,n) ((x << n) | ((x & 0xFFFFFFFF) >> (32 - n)))
#define R(t) \
( \
temp = W[(t - 3) & 0x0F] ^ W[(t - 8) & 0x0F] ^ \
W[(t - 14) & 0x0F] ^ W[ t & 0x0F], \
( W[t & 0x0F] = S(temp,1) ) \
)
#define P(a,b,c,d,e,x) \
{ \
e += S(a,5) + F(b,c,d) + K + x; b = S(b,30); \
}
A = ctx->state[0];
B = ctx->state[1];
C = ctx->state[2];
D = ctx->state[3];
E = ctx->state[4];
#define F(x,y,z) (z ^ (x & (y ^ z)))
#define K 0x5A827999
P( A, B, C, D, E, W[0] );
P( E, A, B, C, D, W[1] );
P( D, E, A, B, C, W[2] );
P( C, D, E, A, B, W[3] );
P( B, C, D, E, A, W[4] );
P( A, B, C, D, E, W[5] );
P( E, A, B, C, D, W[6] );
P( D, E, A, B, C, W[7] );
P( C, D, E, A, B, W[8] );
P( B, C, D, E, A, W[9] );
P( A, B, C, D, E, W[10] );
P( E, A, B, C, D, W[11] );
P( D, E, A, B, C, W[12] );
P( C, D, E, A, B, W[13] );
P( B, C, D, E, A, W[14] );
P( A, B, C, D, E, W[15] );
P( E, A, B, C, D, R(16) );
P( D, E, A, B, C, R(17) );
P( C, D, E, A, B, R(18) );
P( B, C, D, E, A, R(19) );
#undef K
#undef F
#define F(x,y,z) (x ^ y ^ z)
#define K 0x6ED9EBA1
P( A, B, C, D, E, R(20) );
P( E, A, B, C, D, R(21) );
P( D, E, A, B, C, R(22) );
P( C, D, E, A, B, R(23) );
P( B, C, D, E, A, R(24) );
P( A, B, C, D, E, R(25) );
P( E, A, B, C, D, R(26) );
P( D, E, A, B, C, R(27) );
P( C, D, E, A, B, R(28) );
P( B, C, D, E, A, R(29) );
P( A, B, C, D, E, R(30) );
P( E, A, B, C, D, R(31) );
P( D, E, A, B, C, R(32) );
P( C, D, E, A, B, R(33) );
P( B, C, D, E, A, R(34) );
P( A, B, C, D, E, R(35) );
P( E, A, B, C, D, R(36) );
P( D, E, A, B, C, R(37) );
P( C, D, E, A, B, R(38) );
P( B, C, D, E, A, R(39) );
#undef K
#undef F
#define F(x,y,z) ((x & y) | (z & (x | y)))
#define K 0x8F1BBCDC
P( A, B, C, D, E, R(40) );
P( E, A, B, C, D, R(41) );
P( D, E, A, B, C, R(42) );
P( C, D, E, A, B, R(43) );
P( B, C, D, E, A, R(44) );
P( A, B, C, D, E, R(45) );
P( E, A, B, C, D, R(46) );
P( D, E, A, B, C, R(47) );
P( C, D, E, A, B, R(48) );
P( B, C, D, E, A, R(49) );
P( A, B, C, D, E, R(50) );
P( E, A, B, C, D, R(51) );
P( D, E, A, B, C, R(52) );
P( C, D, E, A, B, R(53) );
P( B, C, D, E, A, R(54) );
P( A, B, C, D, E, R(55) );
P( E, A, B, C, D, R(56) );
P( D, E, A, B, C, R(57) );
P( C, D, E, A, B, R(58) );
P( B, C, D, E, A, R(59) );
#undef K
#undef F
#define F(x,y,z) (x ^ y ^ z)
#define K 0xCA62C1D6
P( A, B, C, D, E, R(60) );
P( E, A, B, C, D, R(61) );
P( D, E, A, B, C, R(62) );
P( C, D, E, A, B, R(63) );
P( B, C, D, E, A, R(64) );
P( A, B, C, D, E, R(65) );
P( E, A, B, C, D, R(66) );
P( D, E, A, B, C, R(67) );
P( C, D, E, A, B, R(68) );
P( B, C, D, E, A, R(69) );
P( A, B, C, D, E, R(70) );
P( E, A, B, C, D, R(71) );
P( D, E, A, B, C, R(72) );
P( C, D, E, A, B, R(73) );
P( B, C, D, E, A, R(74) );
P( A, B, C, D, E, R(75) );
P( E, A, B, C, D, R(76) );
P( D, E, A, B, C, R(77) );
P( C, D, E, A, B, R(78) );
P( B, C, D, E, A, R(79) );
#undef K
#undef F
ctx->state[0] += A;
ctx->state[1] += B;
ctx->state[2] += C;
ctx->state[3] += D;
ctx->state[4] += E;
}
void sha1_update( sha1_context *ctx, uint8 *input, uint32 length )
{
uint32 left, fill;
if( ! length ) return;
left = ctx->total[0] & 0x3F;
fill = 64 - left;
ctx->total[0] += length;
ctx->total[0] &= 0xFFFFFFFF;
if( ctx->total[0] < length )
ctx->total[1]++;
if( left && length >= fill )
{
memcpy( (void *) (ctx->buffer + left),
(void *) input, fill );
sha1_process( ctx, ctx->buffer );
length -= fill;
input += fill;
left = 0;
}
while( length >= 64 )
{
sha1_process( ctx, input );
length -= 64;
input += 64;
}
if( length )
{
memcpy( (void *) (ctx->buffer + left),
(void *) input, length );
}
}
static uint8 sha1_padding[64] =
{
0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
void sha1_finish( sha1_context *ctx, uint8 digest[20] )
{
uint32 last, padn;
uint32 high, low;
uint8 msglen[8];
high = ( ctx->total[0] >> 29 )
| ( ctx->total[1] << 3 );
low = ( ctx->total[0] << 3 );
PUT_UINT32( high, msglen, 0 );
PUT_UINT32( low, msglen, 4 );
last = ctx->total[0] & 0x3F;
padn = ( last < 56 ) ? ( 56 - last ) : ( 120 - last );
sha1_update( ctx, sha1_padding, padn );
sha1_update( ctx, msglen, 8 );
PUT_UINT32( ctx->state[0], digest, 0 );
PUT_UINT32( ctx->state[1], digest, 4 );
PUT_UINT32( ctx->state[2], digest, 8 );
PUT_UINT32( ctx->state[3], digest, 12 );
PUT_UINT32( ctx->state[4], digest, 16 );
}
#ifdef TEST
#include <stdlib.h>
#include <stdio.h>
/*
* those are the standard FIPS-180-1 test vectors
*/
static char *msg[] =
{
"abc",
"abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq",
NULL
};
static char *val[] =
{
"a9993e364706816aba3e25717850c26c9cd0d89d",
"84983e441c3bd26ebaae4aa1f95129e5e54670f1",
"34aa973cd4c4daa4f61eeb2bdbad27316534016f"
};
int main( int argc, char *argv[] )
{
FILE *f;
int i, j;
char output[41];
sha1_context ctx;
unsigned char buf[1000];
unsigned char sha1sum[20];
if( argc < 2 )
{
printf( "\n SHA-1 Validation Tests:\n\n" );
for( i = 0; i < 3; i++ )
{
printf( " Test %d ", i + 1 );
sha1_starts( &ctx );
if( i < 2 )
{
sha1_update( &ctx, (uint8 *) msg[i],
strlen( msg[i] ) );
}
else
{
memset( buf, 'a', 1000 );
for( j = 0; j < 1000; j++ )
{
sha1_update( &ctx, (uint8 *) buf, 1000 );
}
}
sha1_finish( &ctx, sha1sum );
for( j = 0; j < 20; j++ )
{
sprintf( output + j * 2, "%02x", sha1sum[j] );
}
if( memcmp( output, val[i], 40 ) )
{
printf( "failed!\n" );
return( 1 );
}
printf( "passed.\n" );
}
printf( "\n" );
}
else
{
if( ! ( f = fopen( argv[1], "rb" ) ) )
{
perror( "fopen" );
return( 1 );
}
sha1_starts( &ctx );
while( ( i = fread( buf, 1, sizeof( buf ), f ) ) > 0 )
{
sha1_update( &ctx, buf, i );
}
sha1_finish( &ctx, sha1sum );
for( j = 0; j < 20; j++ )
{
printf( "%02x", sha1sum[j] );
}
printf( " %s\n", argv[1] );
}
return( 0 );
}
#endif
Here's an example of the generated code for S(x,n) when called by P( E, A, B, C, D, W[1] ):
0031D0 85 18 82 MOV DPL,XSP(L)
0031D3 85 19 83 MOV DPH,XSP(H)
0031D6 78 08 MOV R0,#0x08
0031D8 12 17 85 LCALL ?L_MOV_X
0031DB 74 1E MOV A,#0x1E
0031DD 78 08 MOV R0,#0x08
0031DF 12 16 80 LCALL ?L_SHL
0031E2 85 18 82 MOV DPL,XSP(L)
0031E5 85 19 83 MOV DPH,XSP(H)
0031E8 78 10 MOV R0,#0x10
0031EA 12 17 85 LCALL ?L_MOV_X
0031ED 74 02 MOV A,#0x02
0031EF 78 10 MOV R0,#0x10
0031F1 12 16 67 LCALL ?UL_SHR
0031F4 78 08 MOV R0,#0x08
0031F6 79 10 MOV R1,#0x10
0031F8 12 17 39 LCALL ?L_IOR
0031FB 85 18 82 MOV DPL,XSP(L)
0031FE 85 19 83 MOV DPH,XSP(H)
003201 78 08 MOV R0,#0x08
003203 12 17 94 LCALL ?L_MOV_TO_X
Thanks
If I'm correct, S(x,n) expects x to be a uint32. Given the rather small input size, could the number of shifts be reduced by making x smaller, e.g., uint8?
No. The state of the SHA1 function consists of five 32-bit values which change every iteration, and those values are what S(x,n) is operating on. Changing those into 8-bit values would give you a completely different (and probably very broken!) hash function.
The MD5/SHA family of hash functions all rely heavily on 32-bit integer operations. Ease of implementation on 8-bit processors, like the 8051, was not a design goal for these functions, and implementations on these parts will not perform particularly well. Sorry. You'll need to either live with the slowness, use another microprocessor (or one with SHA1 hardware acceleration!), or use a different hash algorithm.
It sounds like your actual requirement is finding a MAC/PRF that's cheap to compute on your hardware for 8 byte inputs.
Since your data has fixed length, you can use a secure block cipher (with 128 bit blocks) as CBC-MAC. Since your data is shorter than one block, CBC-MAC simplifies to encrypting the data with the raw block cipher/ECB mode.
If your 128 bit block cipher has a similar cost-per-byte as SHA-1, this will result in an 8x speedup compared with HMAC-SHA-1 (SHA-1 has 512 bit blocks and you need to hash two blocks for HMAC). If you choose a cipher that's particularly suited for your CPU, the speedup might be even larger.
Since AES is so popular, finding implementations optimized for 8 bit CPUs shouldn't be too hard.