Divide by power of 2 resulting in float - c

I find myself needing to compute 16-bit unsigned integer divided by power of 2, which should result in a 32-bit float (standard IEEE format). This is on embedded system and the routine is repeatedly used so I am looking for something better than (float)x/(float)(1<<n). In addition, C compiler is pretty limited (no math lib, bit field, reinterpret_cast, etc).

If you don't mind some bit twiddling then the obvious way to go is to convert the integer to float and then subtract n from the exponent bits to achieve the division by 2^n:
y = (float)x; // convert to float
uint32_t yi = *(uint32_t *)&y); // get float value as bits
uint32_t exponent = yi & 0x7f800000; // extract exponent bits 30..23
exponent -= (n << 23); // subtract n from exponent
yi = yi & ~0x7f800000 | exponent; // insert modified exponent back into bits 30..23
y = *(float *)&yi; // copy bits back to float
Note that this fails for x = 0, so you should check x > 0 before conversion.
Total cost is one int-float conversion plus a handful of integer bitwise/arithmetic operations. If you use a union you can avoid having separate int/float representations and just work directly on the float.

Use ldexpf(x, -n). This function is defined by the C standard to do exactly what you want, return x•2-n, so any decent compiler will provide good code for this. (This requires either part of a math library or a compiler that optimizes this to inline code.)
If n is known at compile time, you can also consider x * (1.f/(1<<n)). A good compiler will compute (1.f/(1<<n)) at compile time, so the executable code will be two operations: Convert x to float and multiply by a constant. That might be faster than the code generated for ldexpf(x, -n) if the compiler does not optimize ldexpf(x, -n) as well as it might.

A quick and easy solution is to precompute a table of float values of 2-n for n >= 0 (what's the upper limit for n, around 31?) and then multiply x by the nth element of the table.
This may not be the fastest if your code emulates floating point multiplication because the CPU doesn't support it directly.
You may, however, do it quicker using integer math only.
Example (assuming IEEE-754 32-bit floats):
#include <limits.h>
#include <string.h>
#include <stdio.h>
#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
C_ASSERT(CHAR_BIT == 8);
C_ASSERT(sizeof(float) == 4);
C_ASSERT(sizeof(int) == 4);
float div(int x, unsigned n)
{
float res;
unsigned e = 0;
unsigned sign = x < 0;
unsigned m = sign ? -x : x;
if (m)
{
while (m >= (1u << 24))
m >>= 1, e++;
while (m < (1u << 23))
m <<= 1, e--;
e += 0x7F + 23;
e -= n; // divide by 1<<n
m ^= 1u << 23; // reset the implicit 1
m |= (e & 0xFF) << 23; // mix in the exponent
m |= sign << 31; // mix in the sign
}
memcpy(&res, &m, sizeof m);
return res;
}
void Print4Bytes(unsigned char buf[4])
{
printf("%02X%02X%02X%02X ", buf[3], buf[2], buf[1], buf[0]);
}
int main(void)
{
int x = 0x35AA53;
int n;
for (n = 0; n < 31; n++)
{
float v1 = (float)x/(1u << n);
float v2 = div(x, n);
Print4Bytes((void*)&v1);
printf("%c= ", "!="[memcmp(&v1, &v2, sizeof v1) == 0]);
Print4Bytes((void*)&v2);
printf("%14.6f %14.6f\n", v1, v2);
}
return 0;
}
Output (ideone):
4A56A94C == 4A56A94C 3517011.000000 3517011.000000
49D6A94C == 49D6A94C 1758505.500000 1758505.500000
4956A94C == 4956A94C 879252.750000 879252.750000
48D6A94C == 48D6A94C 439626.375000 439626.375000
4856A94C == 4856A94C 219813.187500 219813.187500
47D6A94C == 47D6A94C 109906.593750 109906.593750
4756A94C == 4756A94C 54953.296875 54953.296875
46D6A94C == 46D6A94C 27476.648438 27476.648438
4656A94C == 4656A94C 13738.324219 13738.324219
45D6A94C == 45D6A94C 6869.162109 6869.162109
4556A94C == 4556A94C 3434.581055 3434.581055
44D6A94C == 44D6A94C 1717.290527 1717.290527
4456A94C == 4456A94C 858.645264 858.645264
43D6A94C == 43D6A94C 429.322632 429.322632
4356A94C == 4356A94C 214.661316 214.661316
42D6A94C == 42D6A94C 107.330658 107.330658
4256A94C == 4256A94C 53.665329 53.665329
41D6A94C == 41D6A94C 26.832664 26.832664
4156A94C == 4156A94C 13.416332 13.416332
40D6A94C == 40D6A94C 6.708166 6.708166
4056A94C == 4056A94C 3.354083 3.354083
3FD6A94C == 3FD6A94C 1.677042 1.677042
3F56A94C == 3F56A94C 0.838521 0.838521
3ED6A94C == 3ED6A94C 0.419260 0.419260
3E56A94C == 3E56A94C 0.209630 0.209630
3DD6A94C == 3DD6A94C 0.104815 0.104815
3D56A94C == 3D56A94C 0.052408 0.052408
3CD6A94C == 3CD6A94C 0.026204 0.026204
3C56A94C == 3C56A94C 0.013102 0.013102
3BD6A94C == 3BD6A94C 0.006551 0.006551
3B56A94C == 3B56A94C 0.003275 0.003275

Related

Multiply float by a number using bitwise operators

I have this function that takes in the bits of a float (f) as a uint32_t. It should use bit operations and + to calculate f * 2048 and should return the bits of this value as a uint32_t.
If the result is too large to be represented as a float, +inf or -inf should be returned returned; and if f is +0, -0, +inf or -inf, or Nan, it should be returned unchanged.
uint32_t float_2048(uint32_t f) {
uint32_t a = (f << 1) ;
int result = a << 10;
return result;
}
This is what I have so far but if I give it the value '1' it returns 0 instead of 2048. How do I fix this?
Some example inputs and outputs:
./float_2048 1
2048
./float_2048 3.14159265
6433.98193
./float_2048 -2.718281828e-20
-5.56704133e-17
./float_2048 1e38
inf
As mentioned in the comments, to multiply a floating-point number by a power of 2 (assuming, as is likely, that it is represented in IEEE-754 format), we can just add that power to the (binary) exponent part of the representation.
For a single-precision (32-bit) float value, that exponent is stored in bits 30-23 and the following code shows how to extract those, add the required value (11, because 2048 = 211), then replace the exponent bits with that modified value.
uint32_t fmul2048(uint32_t f)
{
#define EXPONENT 0x7F800000u
#define SIGN_BIT 0x80000000u
uint32_t expon = (f & EXPONENT) >> 23; // Get exponent value
f &= ~EXPONENT; // Remove old exponent
expon += 11; // Adding 11 to exponent multiplies by 2^11 (= 2048);
if (expon > 254) return EXPONENT | (f & SIGN_BIT); // Too big: return +/- Inf
f |= (expon << 23); // Insert modified exponent
return f;
}
There will, no-doubt, be some "bit trickery" that can be applied to make the code smaller and/or more efficient; but I have avoided doing so in order to keep the code clear. I have also included one error check (for a too large exponent) and the code returns the standard representation for +/- Infinity (all exponent bits set to 1, and keeping the original sign) if that test fails. (I leave other error-checking as an "exercise for the reader".)
To handle all float takes more code.
Do some tests so code can assume the expected float size, matching endian and (IEEE) encoding. C does not require float as 32-bit, matching endian to an integer, not binary32 encoding, even though that is common.
Extract the biased exponent and look for its min and max value.
Max values signify NAN or infinity.
Min values are sub-normals and zero and need special handling. The significand needs to be shifted. If that result is now a normal float, re-encode it.
Biased exponents in between simple need an increment and test for exceeding FLT_MAX's exponent.
Tested successfully for all float.
#include <assert.h>
#include <stdint.h>
static_assert(sizeof(uint32_t) == sizeof(float), "Unexpected float size");
#define IEEE_MASK_BIASED_EXPO 0x7F800000u
#define IEEE_MASK_BIASED_EXPO_LSB 0x00800000u
#define IEEE_MASK_SIGNIFICAND 0x007FFFFFu
#define IEEE_SIGNIFICAND_MAX 0x00FFFFFFu
#define IEEE_INFINITY 0x7F800000u
// Scale value by 2048
uint32_t float_2048(uint32_t f) {
uint32_t expo = f & IEEE_MASK_BIASED_EXPO;
// Test for infinity or NAN
if (expo == IEEE_MASK_BIASED_EXPO) {
return f;
}
// Sub-normal and zero test
if (expo == 0) {
uint64_t sig = f & IEEE_MASK_SIGNIFICAND;
sig <<= 11; // *= 2048;
// If value now a normal one
if (sig > IEEE_MASK_SIGNIFICAND) {
expo += IEEE_MASK_BIASED_EXPO_LSB;
while (sig > IEEE_SIGNIFICAND_MAX) {
sig >>= 1;
expo += IEEE_MASK_BIASED_EXPO_LSB;
}
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
f = (f & ~IEEE_MASK_SIGNIFICAND) | (sig & IEEE_MASK_SIGNIFICAND);
} else {
expo += 11 * IEEE_MASK_BIASED_EXPO_LSB; // *= 2048;
if (expo >= IEEE_MASK_BIASED_EXPO) {
f &= ~(IEEE_MASK_BIASED_EXPO | IEEE_MASK_SIGNIFICAND);
f |= IEEE_INFINITY;
} else {
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
}
return f;
}
Test code.
#include <stdio.h>
#include <stdlib.h>
typedef union {
uint32_t u32;
float f;
} fu32;
int main(void ) {
// Lightweight test to see if endian matches and IEEE encoding
assert((fu32) {.u32 = 0x87654321}.f == -1.72477726182e-34f);
float f[] = {0, FLT_TRUE_MIN, FLT_MIN, 1, FLT_MAX};
size_t n = sizeof f/sizeof f[0];
for (size_t i = 0; i<n; i++) {
fu32 x = { .f = f[i] };
float y0 = x.f * 2048.0f;
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0, y1.f);
}
}
fu32 x = { .u32 = 0 };
do {
fu32 y0 = { .f = isnan(x.f) ? x.f : x.f * 2048.0f };
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0.f, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0.f, y1.f);
printf("%08lx %08lx %08lx\n", (unsigned long) x.u32,
(unsigned long) y0.u32, (unsigned long) y1.u32);
break;
}
x.u32++;
} while (x.u32 != 0);
puts("Done");
}

How to correctly implement multiply for floating point numbers (software FP)

My program is about a method which is given floats and in this method I want to multiply or add those floats. But not multiply like a * b, I want to break those floats down to their structure like the bit for the sign, the 8 bit for the exponent and the rest of the bits as the mantissa.
I want to implement / emulate software floating-point add and multiply (to learn more about what FP hardware has to do).
In the head of the program there are the breakdowns:
#define SIGN(x) (x>>31);
#define MANT(x) (x&0x7FFFFF);
#define EXPO(x) ((x>>23)&0xFF);
#define SPLIT(x, s, m, e) do { \
s = SIGN(x); \
m = MANT(x); \
e = EXPO(x); \
if ( e != 0x00 && e != 0xFF ) { \
m |= 0x800000; \
} \
} while ( 0 )
#define BUILD(x, s, m, e) do { \
x = (s << 31) | (e<<23) | (m&0x7FFFFF); \
} while ( 0 )
The main looks as follows:
float f = 2.3;
float g = 1.8;
float h = foo(&f, &g);
And the method for the calculation looks like:
float foo(float *a, float *b) {
uint32_t ia = *(unsigned int *)a;
uint32_t ib = *(unsigned int *)b;
uint32_t result = 0;
uint32_t signa, signb, signr;
uint32_t manta, mantb, mantr;
uint32_t expoa, expob, expor;
SPLIT(ia, signa, manta, expoa);
SPLIT(ib, signb, mantb, expob);
I already tried the multiply by adding the exponents and multiply their mantissas as follow:
expor = (expoa -127) + (expob -127) + 127;
mantr = (manta) * (mantb);
signr = signa ^ signb;
The return and rebuild of the new float:
BUILD(result, signr, mantr, expor);
return *(float *)&result;
The problem is now, that the result is wrong. the mantr even takes a very low negative Number (in case if foo gets 1.5 and 2.4 mantr takes -838860800 and the result is 2.0000000).
You can't just take truncate the result of the mantissa multiply, you need to take the top 24 bits (after using the low half for rounding) and renormalize (adjust the exponent).
Floating point operations keep the top significand bits. The most significant part of the integer product is the high bits; the low bits are further places after the decimal. (Terminology: it's a "binary point", not "decimal point", because binary floats use radix 2 (binary), not 10 (decimal).)
For normalized inputs, the implicit leading 1 in the input significands means the 32x32 => 64-bit uint64_t product that you use to implement 24 x 24 => 48-bit mantissa multiplication will have its high bit in one of 2 possible locations, so you don't need a bit-scan to find it. A compare or single-bit-test will do.
For subnormal inputs, that's not guaranteed so you need to check where the MSB is, e.g. with GNU C __builtin_clzll. (There are many special cases to handle for one or both inputs being subnormal, and/or the output being subnormal.)
See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for more about the IEEE-754 binary32 format, including the implied leading 1 of the significand.
And see #njuffa's answer for an actual tested + working implementation that does 64-bit operations as two 32-bit halves for some reason, instead of letting C do that efficiently.
Also, return *(float *)&result; violates strict aliasing. It's only safe on MSVC. Use a union or memcpy for type punning in C99 / C11.
Emulating the multiplication of two IEEE-754 (2008) binary32 operands is a bit more complex than the question suggests. In general, we have to distinguish the following operand classes: zeros, subnormals (0 < |x| < 2-126), normals (2126 ≤ |x| < 2128), infinities, NaNs. Normals use biased exponents in [1, 254], while any of the special operand classes use biased exponents in {0, 255}. The following assumes we want to implement floating-point multiply with all floating-point exceptions masked, and using the round-to-nearest-to-even rounding mode.
First, we check whether any of the arguments belongs to a special operand class. If so, we check the special cases in sequence. If one of the arguments is a NaN, we turn that NaN into a QNaN and return it. If one of the operands is zero, we return an appropriately signed zero, unless the other argument is an infinity, in which case we return a special QNaN INDEFINITE since this is an invalid operation. After that we check for any argument of infinity, returning an appropriately signed infinity. This leaves subnormals, which we normalize. In case there are two subnormal arguments, we only need to normalize one of them as the result will underflow to zero.
The multiplication of normals proceeds as the asker envisioned in the question. The sign of the result is the exclusive-OR of the signs of the arguments, the exponent of the result is the sum of the exponents of the arguments (adjusted for exponent bias), and the significand of the result is generated from the product of the significant of the arguments. We need the full product for rounding. We can either use a 64-bit type for that, or represent it with a pair of 32-bit numbers. In the code below I have chose the latter representation. Rounding to nearest-or-even is straightforward: if we have a tie-case (the result is exactly in the middle between the closest two binary32 number), we need to round up if the least significant bit of the mantissa is 1. Otherwise, we need to round up if the most significant discarded bit (the round bit) is 1.
Three cases need to be considered for the result, based on the result exponent prior to rounding: Exponent is in normal range, result overflows (too large in magnitude), or it underflows (too small in magnitude). In the first case, the result is a normal or infinity if overflow occurs during rounding. In the second case, the result is infinity. In the last case the result is either zero (severe underflow), a subnormal, or the smallest normal (if round-up occurs).
The following code, with a simple framework for light testing via gobs of random test cases and several thousand interesting patterns shows an exemplary ISO-C implementation written in a couple of hours for reasonable clarity and reasonable performance. I let the test framework run for an hour or so on an x64 platform and no errors were reported. If you plan to use the code in production, you would want to construct a more stringent test framework, and may need additional performance tuning.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <limits.h>
#define FLOAT_MANT_BITS (23)
#define FLOAT_EXPO_BITS (8)
#define FLOAT_EXPO_BIAS (127)
#define FLOAT_MANT_MASK (~((~0u) << (FLOAT_MANT_BITS+1))) /* incl. integer bit */
#define EXPO_ADJUST (1) /* adjustment for performance reasons */
#define MIN_NORM_EXPO (1) /* minimum biased exponent of normals */
#define MAX_NORM_EXPO (254) /* maximum biased exponent of normals */
#define INF_EXPO (255) /* biased exponent of infinities */
#define EXPO_MASK (~((~0u) << FLOAT_EXPO_BITS))
#define FLOAT_SIGN_MASK (0x80000000u)
#define FLOAT_IMPLICIT_BIT (1 << FLOAT_MANT_BITS)
#define RND_BIT_SHIFT (31)
#define RND_BIT_MASK (1u << RND_BIT_SHIFT)
#define FLOAT_INFINITY (0x7f800000)
#define FLOAT_INDEFINITE (0xffc00000u)
#define MANT_LSB (0x00000001)
#define FLOAT_QNAN_BIT (0x00400000)
#define MAX_SHIFT (FLOAT_MANT_BITS + 2)
uint32_t fp32_mul_core (uint32_t a, uint32_t b)
{
uint64_t prod;
uint32_t expoa, expob, manta, mantb, shift;
uint32_t r, signr, expor, mantr_hi, mantr_lo;
/* split arguments into sign, exponent, significand */
expoa = ((a >> FLOAT_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
expob = ((b >> FLOAT_MANT_BITS) & EXPO_MASK) - EXPO_ADJUST;
manta = (a | FLOAT_IMPLICIT_BIT) & FLOAT_MANT_MASK;
mantb = (b | FLOAT_IMPLICIT_BIT) & FLOAT_MANT_MASK;
/* result sign bit: XOR sign argument signs */
signr = (a ^ b) & FLOAT_SIGN_MASK;
if ((expoa >= (MAX_NORM_EXPO - EXPO_ADJUST)) || /* at least one argument is special */
(expob >= (MAX_NORM_EXPO - EXPO_ADJUST))) {
if ((a & ~FLOAT_SIGN_MASK) > FLOAT_INFINITY) { /* a is NaN */
/* return quietened NaN */
return a | FLOAT_QNAN_BIT;
}
if ((b & ~FLOAT_SIGN_MASK) > FLOAT_INFINITY) { /* b is NaN */
/* return quietened NaN */
return b | FLOAT_QNAN_BIT;
}
if ((a & ~FLOAT_SIGN_MASK) == 0) { /* a is zero */
/* return NaN if b is infinity, else zero */
return (expob != (INF_EXPO - EXPO_ADJUST)) ? signr : FLOAT_INDEFINITE;
}
if ((b & ~FLOAT_SIGN_MASK) == 0) { /* b is zero */
/* return NaN if a is infinity, else zero */
return (expoa != (INF_EXPO - EXPO_ADJUST)) ? signr : FLOAT_INDEFINITE;
}
if (((a & ~FLOAT_SIGN_MASK) == FLOAT_INFINITY) || /* a or b infinity */
((b & ~FLOAT_SIGN_MASK) == FLOAT_INFINITY)) {
return signr | FLOAT_INFINITY;
}
if ((int32_t)expoa < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* a is subnormal */
/* normalize significand of a */
manta = a & FLOAT_MANT_MASK;
expoa++;
do {
manta = 2 * manta;
expoa--;
} while (manta < FLOAT_IMPLICIT_BIT);
} else if ((int32_t)expob < (MIN_NORM_EXPO - EXPO_ADJUST)) { /* b is subnormal */
/* normalize significand of b */
mantb = b & FLOAT_MANT_MASK;
expob++;
do {
mantb = 2 * mantb;
expob--;
} while (mantb < FLOAT_IMPLICIT_BIT);
}
}
/* result exponent: add argument exponents and adjust for biasing */
expor = expoa + expob - FLOAT_EXPO_BIAS + 2 * EXPO_ADJUST;
mantb = mantb << FLOAT_EXPO_BITS; /* preshift to align result signficand */
/* result significand: multiply argument signficands */
prod = (uint64_t)manta * mantb;
mantr_hi = (uint32_t)(prod >> 32);
mantr_lo = (uint32_t)(prod >> 0);
/* normalize significand */
if (mantr_hi < FLOAT_IMPLICIT_BIT) {
mantr_hi = (mantr_hi << 1) | (mantr_lo >> (32 - 1));
mantr_lo = (mantr_lo << 1);
expor--;
}
if (expor <= (MAX_NORM_EXPO - EXPO_ADJUST)) { /* normal, may overflow to infinity during rounding */
/* combine biased exponent, sign and signficand */
r = (expor << FLOAT_MANT_BITS) + signr + mantr_hi;
/* round result to nearest or even; overflow to infinity possible */
r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
} else if ((int32_t)expor > (MAX_NORM_EXPO - EXPO_ADJUST)) { /* overflow */
/* return infinity */
r = signr | FLOAT_INFINITY;
} else { /* underflow */
/* return zero, normal, or smallest subnormal */
shift = 0 - expor;
if (shift > MAX_SHIFT) shift = MAX_SHIFT;
/* denormalize significand */
mantr_lo = mantr_hi << (32 - shift) | (mantr_lo ? 1 : 0);
mantr_hi = mantr_hi >> shift;
/* combine sign and signficand; biased exponent known to be zero */
r = mantr_hi + signr;
/* round result to nearest or even */
r = r + ((mantr_lo == RND_BIT_MASK) ? (mantr_hi & MANT_LSB) : (mantr_lo >> RND_BIT_SHIFT));
}
return r;
}
uint32_t float_as_uint (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float fp32_mul (float a, float b)
{
return uint_as_float (fp32_mul_core (float_as_uint (a), float_as_uint (b)));
}
/* Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 */
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define ISNAN(x) ((float_as_uint (x) << 1) > 0xff000000)
#define QNAN(x) (x | FLOAT_QNAN_BIT)
#define PURELY_RANDOM (0)
#define PATTERN_BASED (1)
#define TEST_MODE (PURELY_RANDOM)
uint32_t v[8192];
int main (void)
{
unsigned long long count = 0;
float a, b, res, ref;
uint32_t i, j, patterns, idx = 0, nbrBits = sizeof (uint32_t) * CHAR_BIT;
/* pattern class 1: 2**i */
for (i = 0; i < nbrBits; i++) {
v [idx] = ((uint32_t)1 << i);
idx++;
}
/* pattern class 2: 2**i-1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = (((uint32_t)1 << i) - 1);
idx++;
}
/* pattern class 3: 2**i+1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = (((uint32_t)1 << i) + 1);
idx++;
}
/* pattern class 4: 2**i + 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = (((uint32_t)1 << i) + ((uint32_t)1 << j));
idx++;
}
}
/* pattern class 5: 2**i - 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = (((uint32_t)1 << i) - ((uint32_t)1 << j));
idx++;
}
}
/* pattern class 6: MAX_UINT/(2**i+1) rep. blocks of i zeros an i ones */
for (i = 0; i < nbrBits; i++) {
v [idx] = ((~(uint32_t)0) / (((uint32_t)1 << i) + 1));
idx++;
}
patterns = idx;
/* pattern class 6: one's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = ~v [i];
idx++;
}
/* pattern class 7: two's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = ~v [i] + 1;
idx++;
}
patterns = idx;
#if TEST_MODE == PURELY_RANDOM
printf ("using purely random test vectors\n");
#elif TEST_MODE == PATTERN_BASED
printf ("using pattern-based test vectors\n");
printf ("#patterns = %u\n", patterns);
#endif // TEST_MODE
do {
#if TEST_MODE == PURELY_RANDOM
a = uint_as_float (KISS);
b = uint_as_float (KISS);
#elif TEST_MODE == PATTERN_BASED
i = KISS % patterns;
j = KISS % patterns;
a = uint_as_float ((v[i] & 0x7fffff) | (KISS & ~0x7fffff));
b = uint_as_float ((v[j] & 0x7fffff) | (KISS & ~0x7fffff));
#endif // TEST_MODE
res = fp32_mul (a, b);
ref = a * b;
/* check for bit pattern mismatch between result and reference */
if (float_as_uint (res) != float_as_uint (ref)) {
/* if both a and b are NaNs, either could be returned quietened */
if (! (ISNAN (a) && ISNAN (b) &&
((QNAN (float_as_uint (a)) == float_as_uint (res)) ||
(QNAN (float_as_uint (b)) == float_as_uint (res))))) {
printf ("err: a=% 15.8e (%08x) b=% 15.8e (%08x) res=% 15.8e (%08x) ref=%15.8e (%08x)\n",
a, float_as_uint(a), b, float_as_uint (b), res, float_as_uint (res), ref, float_as_uint (ref));
return EXIT_FAILURE;
}
}
count++;
if (!(count & 0xffffff)) printf ("\r%llu", count);
} while (1);
return EXIT_SUCCESS;
}
It is much more complicated. Take a look on the source of the softmath library (for example https://github.com/riscv/riscv-pk/blob/master/softfloat/f64_mul.c). Clone it and analyze.

Getting the exponent from a floating point in C

I'm writing a function that will get the exponent of a floating point number (IEEE 754 standard) but for some reason when I use the right shift bitwise operator on the number it returns 0
Here is the function
int get_exp (int x)
{
return ( ((x >> 21) & 255) -127 );
}
I'm passing it 7.23 so the output should be 2, for some reason the (x >> 21) part returns 0 when it should actually be returning 129. The 255 is the mask I'm using to and (&) with the exponent part of the floating point number.
I'm guessing you're doing some kind of casting hocus-pocus to pass floating point as ints? I would use float frexpf (float x, int* exp); as defined in <math.h>.
#include <math.h>
int get_exp(float x)
{
int exp;
frexpf(x, &exp);
return exp;
}
It's guaranteed to work regardless of the sizes of the floating point types.
If you want to roll it yourself, you can adapt this code.
#define EXPONENT_BIAS (-127)
int get_exp(float f)
{
int i;
union {
// Set here, then use s or c to extract
float f;
// This may or may not work for you
struct {
unsigned int sign: 1;
unsigned int exponent: 8;
unsigned int mantissa: 23;
} s;
// For debugging purposes
unsigned char c[sizeof(float)];
} u;
// Assign, you might need to reverse the bytes!
u.f = f;
// You'll probably need this to figure out the field widths
for (i = 0; i < sizeof(float); i++)
fprintf(stderr, "%02x%s", u.c[i], (i + 1 < sizeof(float))? " ": "\n");
// Just return the exponent
return (int)u.s.exponent + EXPONENT_BIAS;
}
This will bite you if sizeof(float) != 4, or if you switch endian-ness.
Main issue is the passing of int rather than float and using 21 vs 23. #dbush
IEEE 754 standard (binary32) has a number of corner cases: Inifinty, NaN, sub-normal including zero. So additional code is needed to cope with them.
Assuming proper endian:
int get_exp(float x) {
assert(sizeof x == sizeof(uint32_t));
union {
float x;
uint32_t u32;
} u = { x };
#define EXPOSHIFT 23
#define EXPOMASK 255
#define EXPOBIAS 127
if (x == 0.0) return 0;
int expo = (int) (u.u32 >> EXPOSHIFT) & EXPOMASK;
if (expo == EXPOMASK) return INT_MAX; // x is infinity or NaN
if (expo == 0) return get_exp(x * (1L << EXPOSHIFT)) - EXPOSHIFT;
return expo - EXPOBIAS;
}
Working under the assumption that a float is 32 bit and is laid out as specified here, you have three issues:
Your function needs to accept a float.
You need to point a uint32_t to the address of the float so it sees the same bytes, then perform actions against the dereferenced pointer.
The exponent starts at the 24th (23 if you start from 0) bit, not the 22nd (21 if you start with 0), so you have to shift by 23.
#include <stdio.h>
#include <stdint.h>
int get_exp (float x)
{
uint32_t *i = (uint32_t *)&x;
return ( ((*i >> 23) & 255) -127 );
}
int main()
{
printf("exp=%d\n",get_exp(7.23));
}
Result:
exp=2
Should performance not be an issue, simply iterate:
int expof(float f) {
int expo = 0;
if (f < 0.0) f = -f;
while (f < 0.5f) {
f *= 2.0f;
expo--;
}
while (f >= 1.0f) {
f *= 0.5f;
expo++;
}
return expo;
}
Does not depend on any particular float implementation other than the exponent fits in int. It use no external functions as commented here.
Same result as from int expo; frexpf(f, &expo); return expo
The parameter list show
int x
and you pass a floating point number. Try to substitute with
float x

Floating point emulation or Fixed Point for numbers in a given range

I have a co-processor which does not have floating point support. I tried to use 32 bit fix point, but it is unable to work on very small numbers. My numbers range from 1 to 1e-18. One way is to use floating point emulation, but it is too slow. Can we make it faster in this case where we know the numbers won't be greater than 1 and smaller than 1e-18. Or is there a way to make fix point work on very small numbers.
It is not possible for a 32-bit fixed-point encoding to represent numbers from 10–18 to 1. This is immediately obvious from the fact that the span from 10-18 is a ratio of 1018, but the non-zero encodings of a 32-bit integer span a ratio of less than 232, which is much less than 1018. Therefore, no choice of scale for the fixed-point encoding will provide the desired span.
So a 32-bit fixed-point encoding will not work, and you must use some other technique.
In some applications, it may be suitable to use multiple fixed-point encodings. That is, various input values would be encoded with a fixed-point encoding but each with a scale suitable to it, and intermediate values and the outputs would also have customized scales. Obviously, this is possible only if suitable scales can be determined at design time. Otherwise, you should abandon 32-bit fixed-point encodings and consider alternatives.
Will simplified 24-bit floating point be fast enough and accurate enough?:
#include <stdio.h>
#include <limits.h>
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned myfloat;
#else
typedef unsigned long myfloat;
#endif
#define MF_EXP_BIAS 0x80
myfloat mfadd(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16;
if (ea > eb)
{
a &= 0xFFFF;
b = (b & 0xFFFF) >> (ea - eb);
if ((a += b) > 0xFFFF)
a >>= 1, ++ea;
return a | ((myfloat)ea << 16);
}
else if (eb > ea)
{
b &= 0xFFFF;
a = (a & 0xFFFF) >> (eb - ea);
if ((b += a) > 0xFFFF)
b >>= 1, ++eb;
return b | ((myfloat)eb << 16);
}
else
{
return (((a & 0xFFFF) + (b & 0xFFFF)) >> 1) | ((myfloat)++ea << 16);
}
}
myfloat mfmul(myfloat a, myfloat b)
{
unsigned ea = a >> 16, eb = b >> 16, e = ea + eb - MF_EXP_BIAS;
myfloat p = ((a & 0xFFFF) * (b & 0xFFFF)) >> 16;
return p | ((myfloat)e << 16);
}
myfloat double2mf(double x)
{
myfloat f;
unsigned e = MF_EXP_BIAS + 16;
if (x <= 0)
return 0;
while (x < 0x8000)
x *= 2, --e;
while (x >= 0x10000)
x /= 2, ++e;
f = x;
return f | ((myfloat)e << 16);
}
double mf2double(myfloat f)
{
double x;
unsigned e = (f >> 16) - 16;
if ((f & 0xFFFF) == 0)
return 0;
x = f & 0xFFFF;
while (e > MF_EXP_BIAS)
x *= 2, --e;
while (e < MF_EXP_BIAS)
x /= 2, ++e;
return x;
}
int main(void)
{
double testConvData[] = { 1e-18, .25, 0.3333333, .5, 1, 2, 3.141593, 1e18 };
unsigned i;
for (i = 0; i < sizeof(testConvData) / sizeof(testConvData[0]); i++)
printf("%e -> 0x%06lX -> %e\n",
testConvData[i],
(unsigned long)double2mf(testConvData[i]),
mf2double(double2mf(testConvData[i])));
printf("300 * 5 = %e\n", mf2double(mfmul(double2mf(300),double2mf(5))));
printf("500 + 3 = %e\n", mf2double(mfadd(double2mf(500),double2mf(3))));
printf("1e18 * 1e-18 = %e\n", mf2double(mfmul(double2mf(1e18),double2mf(1e-18))));
printf("1e-18 + 2e-18 = %e\n", mf2double(mfadd(double2mf(1e-18),double2mf(2e-18))));
printf("1e-16 + 1e-18 = %e\n", mf2double(mfadd(double2mf(1e-16),double2mf(1e-18))));
return 0;
}
Output (ideone):
1.000000e-18 -> 0x459392 -> 9.999753e-19
2.500000e-01 -> 0x7F8000 -> 2.500000e-01
3.333333e-01 -> 0x7FAAAA -> 3.333282e-01
5.000000e-01 -> 0x808000 -> 5.000000e-01
1.000000e+00 -> 0x818000 -> 1.000000e+00
2.000000e+00 -> 0x828000 -> 2.000000e+00
3.141593e+00 -> 0x82C90F -> 3.141541e+00
1.000000e+18 -> 0xBCDE0B -> 9.999926e+17
300 * 5 = 1.500000e+03
500 + 3 = 5.030000e+02
1e18 * 1e-18 = 9.999390e-01
1e-18 + 2e-18 = 2.999926e-18
1e-16 + 1e-18 = 1.009985e-16
Subtraction is left as an exercise. Ditto for better conversion routines.
Use 64 bit fixed point and be done with it.
Compared with 32 bit fixed point it will be four times slower for multiplication, but it will still be far more efficient than float emulation.
In embedded systems I'd suggest using 16+32, 16+16, 8+16 or 8+24 bit redundant floating point representation, where each number is simply M * 2^exp.
In this case you can choose to represent zero with both M=0 and exp=0; There are 16-32 representations for each power of 2 -- and that mainly makes comparison a bit harder than typically. Also one can postpone normalization e.g. after subtraction.

How to manually (bitwise) perform (float)x?

Now, here is the function header of the function I'm supposed to implement:
/*
* float_from_int - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_from_int(int x) {
...
}
We aren't allowed to do float operations, or any kind of casting.
Now I tried to implement the first algorithm given at this site: http://locklessinc.com/articles/i2f/
Here's my code:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if(x == 0x80000000){ //Updated to add this
return 0xcf000000;
}
//int shiftsNeeded = 0;
/*while(){
shiftsNeeded++;
}*/
unsigned I2F_MAX_BITS = 15;
unsigned I2F_MAX_INPUT = ((1 << I2F_MAX_BITS) - 1);
unsigned I2F_SHIFT = (24 - I2F_MAX_BITS);
unsigned result, i, exponent, fraction;
if ((absValOfX & I2F_MAX_INPUT) == 0)
result = 0;
else {
exponent = 126 + I2F_MAX_BITS;
fraction = (absValOfX & I2F_MAX_INPUT) << I2F_SHIFT;
i = 0;
while(i < I2F_MAX_BITS) {
if (fraction & 0x800000)
break;
else {
fraction = fraction << 1;
exponent = exponent - 1;
}
i++;
}
result = (xIsNegative << 31) | exponent << 23 | (fraction & 0x7fffff);
}
return result;
}
But it didn't work (see test error below):
ERROR: Test float_from_int(8388608[0x800000]) failed...
...Gives 0[0x0]. Should be 1258291200[0x4b000000]
I don't know where to go from here. How should I go about parsing the float from this int?
EDIT #1:
You might be able to see from my code that I also started working on this algorithm (see this site):
I assumed 10-bit, 2’s complement, integers since the mantissa is only
9 bits, but the process generalizes to more bits.
Save the sign bit of the input and take the absolute value of the input.
Shift the input left until the high order bit is set and count the number of shifts required. This forms the floating mantissa.
Form the floating exponent by subtracting the number of shifts from step 2 from the constant 137 or (0h89-(#of shifts)).
Assemble the float from the sign, mantissa, and exponent.
But, that doesn't seem right. How could I convert 0x80000000? Doesn't make sense.
EDIT #2:
I think it's because I say max bits is 15... hmmm...
EDIT #3: Screw that old algorithm, I'm starting over:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if (x == 0x80000000){
return 0xcf000000;
}
int shiftsNeeded = 0;
int counter = 0;
while(((absValOfX >> counter) & 1) != 1 && shiftsNeeded < 32){
counter++;
shiftsNeeded++;
}
unsigned exponent = shiftsNeeded + 127;
unsigned result = (xIsNegative << 31) | (exponent << 23);
return result;
Here's the error I get on this one (I think I got past the last error):
ERROR: Test float_from_int(-2139095040[0x80800000]) failed...
...Gives -889192448[0xcb000000]. Should be -822149120[0xceff0000]
May be helpful to know that:
absValOfX = 7f800000
(using printf)
EDIT #4: Ah, I'm finding the exponent wrong, need to count from the left, then subtract from 32 I believe.
EDIT #5: I started over, now trying to deal with weird rounding problems...
if (x == 0){
return 0; // 0 is a special case because it has no 1 bits
}
if (x >= 0x80000000 && x <= 0x80000040){
return 0xcf000000;
}
// Save the sign bit of the input and take the absolute value of the input.
unsigned signBit = 0;
unsigned absX = (unsigned)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (unsigned)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
unsigned negativeRoundUp = (absX >> 7) & 1 & (absX >> 8);
// compute mantissa
unsigned mantissa = (absX >> 8) + ((negativeRoundUp) || (!signBit & (absX >> 7) & (exponent < 156)));
printf("absX = %x, absX >> 8 = %x, exponent = %i, mantissa = %x\n", absX, (absX >> 8), exponent, mantissa);
// Assemble the float from the sign, mantissa, and exponent.
return signBit | ((exponent << 23) + (signBit & negativeRoundUp)) | ( (mantissa) & 0x7fffff);
-
absX = fe000084, absX >> 8 = fe0000, exponent = 156, mantissa = fe0000
ERROR: Test float_from_int(1065353249[0x3f800021]) failed...
...Gives 1316880384[0x4e7e0000]. Should be 1316880385[0x4e7e0001]
EDIT #6
Did it again, still, the rounding doesn't work properly. I've tried to hack together some rounding, but it just won't work...
unsigned float_from_int(int x) {
/*
If N is negative, negate it in two's complement. Set the high bit (2^31) of the result.
If N < 2^23, left shift it (multiply by 2) until it is greater or equal to.
If N ≥ 2^24, right shift it (unsigned divide by 2) until it is less.
Bitwise AND with ~2^23 (one's complement).
If it was less, subtract the number of left shifts from 150 (127+23).
If it was more, add the number of right shifts to 150.
This new number is the exponent. Left shift it by 23 and add it to the number from step 3.
*/
printf("---------------\n");
//printf("x = %i (%x), -x = %i, (%x)\n", x, x, -x, -x);
if(x == 0){
return 0;
}
if(x == 0x80000000){
return 0xcf000000;
}
// If N is negative, negate it in two's complement. Set the high bit of the result
unsigned signBit = 0;
if (x < 0){
signBit = 0x80000000;
x = -x;
}
printf("abs val of x = %i (%x)\n", x, x);
int roundTowardsZero = 0;
int lastDigitLeaving = 0;
int shiftAmount = 0;
int originalAbsX = x;
// If N < 2^23, left shift it (multiply it by 2) until it is great or equal to.
if(x < (8388608)){
while(x < (8388608)){
//printf(" minus shift and x = %i", x );
x = x << 1;
shiftAmount--;
}
} // If N >= 2^24, right shfit it (unsigned divide by 2) until it is less.
else if(x >= (16777215)){
while(x >= (16777215)){
/*if(x & 1){
roundTowardsZero = 1;
printf("zzz Got here ---");
}*/
lastDigitLeaving = (x >> 1) & 1;
//printf(" plus shift and x = %i", x);
x = x >> 1;
shiftAmount++;
}
//Round towards zero
x = (x + (lastDigitLeaving && (!(originalAbsX > 16777216) || signBit)));
printf("x = %i\n", x);
//shiftAmount = shiftAmount + roundTowardsZero;
}
printf("roundTowardsZero = %i, shiftAmount = %i (%x)\n", roundTowardsZero, shiftAmount, shiftAmount);
// Bitwise AND with 0x7fffff
x = x & 0x7fffff;
unsigned exponent = 150 + shiftAmount;
unsigned rightPlaceExponent = exponent << 23;
printf("exponent = %i, rightPlaceExponent = %x\n", exponent, rightPlaceExponent);
unsigned result = signBit | rightPlaceExponent | x;
return result;
The problem is that the lowest int is -2147483648, but the highest is 2147483647, so there is no absolute value of -2147483648. While you could work around it, I would just make a special case for that one bit pattern (like you do for 0):
if (x == 0)
return 0;
if (x == -2147483648)
return 0xcf000000;
The other problem is that you copied an algorithm that only works for numbers from 0 to 32767. Further down in the article they explain how to expand it to all ints, but it uses operations that you're likely not allowed to use.
I would recommend writing it from scratch based on the algorithm mentioned in your edit. Here's a version in C# that rounds towards 0:
uint float_from_int(int x)
{
if (x == 0)
return 0; // 0 is a special case because it has no 1 bits
// Save the sign bit of the input and take the absolute value of the input.
uint signBit = 0;
uint absX = (uint)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (uint)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
uint exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
// compute mantissa
uint mantissa = absX >> 8;
// Assemble the float from the sign, mantissa, and exponent.
return signBit | (exponent << 23) | (mantissa & 0x7fffff);
}
The basic formulation of the algorithm is to determine the sign, exponent and mantissa bits, then pack the result into an integer. Breaking it down this way makes it easy to clearly separate the tasks in code and makes solving the problem (and testing your algorithm) much easier.
The sign bit is the easiest, and getting rid of it makes finding the exponent easier. You can distinguish four cases: 0, 0x80000000, [-0x7ffffff, -1], and [1, 0x7fffffff]. The first two are special cases, and you can trivially get the sign bit in the last two cases (and the absolute value of the input). If you're going to cast to unsigned, you can get away with not special-casing 0x80000000 as I mentioned in a comment.
Next up, find the exponent -- there's an easy (and costly) looping way, and a trickier but faster way to do this. My absolute favourite page for this is Sean Anderson's bit hacks page. One of the algorithms shows a very quick loop-less way to find the log2 of an integer in only seven operations.
Once you know the exponent, then finding the mantissa is easy. You just drop the leading one bit, then shift the result either left or right depending on the exponent's value.
If you use the fast log2 algorithm, you can probably end up with an algorithm which uses no more than 20 operations.
Dealing with 0x80000000 is pretty easy:
int xIsNegative = 0;
unsigned int absValOfX = x;
if (x < 0)
{
xIsNegative = 1;
absValOfX = -(unsigned int)x;
}
It gets rid of special casing -2147483648 since that value is representable as an unsigned value, and absValOfX should always be positive.

Resources