Round to IEEE 754 precision but keep binary format

Round to IEEE 754 precision but keep binary format - c

If I convert the decimal number 3120.0005 to float (32-bit) representation, the number gets rounded down to 3120.00048828125.
Assuming we're using a fixed point number with a scale of 10^12 then 1000000000000 = 1.0 and 3120000500000000 = 3120.0005.
What would the formula/algorithm be to round down to the nearest IEEE 754 precision to get 3120000488281250?
I would also need a way to get the result of rounding up (3120000732421875).

If you divide by the decimal scaling factor, you'll find your nearest representable float. For rounding the other direction, std::nextafter can be used:
#include <float.h>
#include <math.h>
#include <stdio.h>
long long scale_to_fixed(float f)
{
float intf = truncf(f);
long long result = 1000000000000LL;
result *= (long long)intf;
result += round((f - intf) * 1.0e12);
return result;
}
/* not needed, always good enough to use (float)(n / 1.0e12) */
float scale_from_fixed(long long n)
{
float result = (n % 1000000000000LL) / 1.0e12;
result += n / 1000000000000LL;
return result;
}
int main()
{
long long x = 3120000500000000;
float x_reduced = scale_from_fixed(x);
long long y1 = scale_to_fixed(x_reduced);
long long yfloor = y1, yceil = y1;
if (y1 < x) {
yceil = scale_to_fixed(nextafterf(x_reduced, FLT_MAX));
}
else if (y1 > x) {
yfloor = scale_to_fixed(nextafterf(x_reduced, -FLT_MAX));
}
printf("%lld\n%lld\n%lld\n", yfloor, x, yceil);
}
Results:
3120000488281250
3120000500000000
3120000732421875

In order to handle the values as float scaled by 1e12 and compute the next larger power of two, e.g. "rounding up (3120000732421875)", the key is understanding that you are looking for the next larger power of two from the 32-bit representation of x / 1.0e12. While you can mathematically arrive at this value, a union between float and unsigned (or uint32_t) provides a direct way to interpret the stored 32-bit value for the floating-point number as an unsigned value.1
A simple example utilizing a the union prev to hold the reduced value of x and a separate instance next holding the unsigned value (+1) can be:
#include <stdio.h>
#include <inttypes.h>
int main (void) {
uint64_t x = 3120000500000000;
union { /* union between float and uint32_t */
float f;
uint32_t u;
} prev = { .f = x / 1.0e12 }, /* x reduced to float, pwr of 2 as .u */
next = { .u = prev.u + 1u }; /* 2nd union, increment pwr of 2 by 1 */
printf ("prev : %" PRIu64 "\n x : %" PRIu64 "\nnext : %" PRIu64 "\n",
(uint64_t)(prev.f * 1e12), x, (uint64_t)(next.f * 1e12));
}
Example Use/Output
$ ./bin/pwr2_prev_next
prev : 3120000488281250
x : 3120000500000000
next : 3120000732421875
Footnotes:
1. As an alternative, you can use a pointer to char to hold the address of the floating point type and interpret the 4-byte value stored at that location as unsigned without running afoul of C11 Standard - §6.5 Expressions (p6,7) (the "Strict Aliasing Rule"), but the use of a union is preferred.

Related

Problem with float and int multiplication in C [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 months ago.
I'm using the online compiler https://www.onlinegdb.com/ and in the following code when I multiply 2.1 with 100 the output becomes 209 instead of 210.
#include<stdio.h>
#include <stdint.h>
int main()
{
float x = 1.8;
x = x + 0.3;
int coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = (uint16_t)(x * coefficient);
printf("y: %d\n", y);
return 0;
}
Where am I doing wrong? And what should I do to obtain 210?
I tried to all different type casts still doesn't work.

The following assumes the compiler uses IEEE-754 binary32 and binary64 for float and double, which is overwhelmingly common.
float x = 1.8;
Since 1.8 is a double constant, the compiler converts 1.8 to the nearest double value, 1.8000000000000000444089209850062616169452667236328125. Then, to assign it to the float x, it converts that to the nearest float value, 1.7999999523162841796875.
x = x + 0.3;
The compiler converts 0.3 to the nearest double value, 0.299999999999999988897769753748434595763683319091796875. Then it adds x and that value using double arithmetic, which produces 2.09999995231628400205181605997495353221893310546875.
Then, to assign that to x, it converts it to the nearest float value, 2.099999904632568359375.
uint16_t y = (uint16_t)(x * coefficient);
Since x is float and coefficient is int, the compiler converts the coefficient to float and performs the multiplication using float arithmetic. This produces 209.9999847412109375.
Then the conversion to uint16_t truncates the number, producing 209.
One way to get 210 instead is to use uint16_t y = lroundf(x * coefficient);. (lroundf is declared in <math.h>.) However, to determine what the right way is, you should explain what these numbers are and why you are doing this arithmetic with them.

Floating point numbers are not exact, when you add 1.8 + 0.3,
the FPU might generate a slightly different result from the expected 2.1 (by margin smaller then float Epsilon)
read more about floating-point numbers representation in wiki https://en.wikipedia.org/wiki/Machine_epsilon
what happens to you is:
1.8 + 0.3 = 209.09999999...
then you truncate it to int resulting in 209
you might find this question also relevant to you Why float.Epsilon and not zero? might be

#include<stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
float x = 1.8;
x = x + 0.3;
uint16_t coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = round(x * coefficient);
printf("y: %" PRIu16 "\n", y);
return 0;
}

Print float without using printf() in c

I am trying to code my printf() function. I wanted to print float/double values. This is what I managed to do so far.
static void ft_float(va_list *ap, t_flag *flags)
{
double myfloat;
signed long int decipart;
signed long int intpart;
myfloat = va_arg(*ap, double);
if (myfloat < 0)
{
ft_myputchar('-');
myfloat *= -1;
}
intpart = (signed long int)myfloat;
ft_putnbr(intpart);
ft_myputchar('.');
myfloat -= intpart;
myfloat *= 1000000; //upto 6 decimal points
decipart = (signed long int)(myfloat + 0.5); //+0.5 to round of the value
ft_putnbr(decipart);
}
As you can see for obvious reasons the code works good for floats like 1.424352, 12313.1341414 etc. But not when the value after the decimal point is less than 1, for example 1.004243, 12313.0001341 etc.

Function printf can be used to pad the value with zeroes. Flag 0 sets the padding to be 0, and the width specifies the minimum number of characters written, padding at the left side is used if necessary.
Simply printf decipart with the format: "%06ld".

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.
Say:
(double)(xb_double) + (double)(xb_double) = ?
Then we convert both addends to a fixed point representation (integer),
(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)
To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,
FixedToDouble(xsum_fixed) => xsum_double
Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)
WHAT I HAVE TRIED
int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;
double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));
int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}
double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}
int sum_fixed(int x, int y)
{
return (x + y - (1.65*slope)); //analysis, just basic math
}
int subtract_fixed(int x, int y)
{
return (x - y + (1.65*slope));
}
int product_fixed(int x, int y)
{
return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}
And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),
With my code,
xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);
I get the answer:
xsum_double = 3.001613
Which is very close to the correct answer (double)(3.00)
Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.
HERE'S THE CATCH:
So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.
So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.
So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.
UPDATE:
I also found a simpler code in converting fractional numbers to fixed-point:
//const int scale = 16; //1/2^16 in 32 bits
#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)
#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y))
However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?
credits to #chux

You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:
union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa
After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);
Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

function to map a double into a long number

Maybe it seems a little bit rare question, but I would like to find a function able to transform a double (c number) into a long (c number). It's not necessary to preserve the double information. The most important thing is:
double a,b;
long c,d;
c = f(a);
d = f(b);
This must be truth:
if (a < b) then c < d for all a,b double and for all c,d long
Thank you to all of you.

Your requirement is feasible if the following two conditions hold:
The compiler defines sizeof(double) the same as sizeof(long)
The hardware uses IEEE 754 double-precision binary floating-point format
While the 2nd condition holds on every widely-used platform, the 1st condition does not.
If both conditions do hold on your platform, then you can implement the function as follows:
long f(double x)
{
if (x > 0)
return double_to_long(x);
if (x < 0)
return -double_to_long(-x);
return 0;
}
You have several different ways to implement the conversion function:
long double_to_long(double x)
{
long y;
memcpy(&y,&x,sizeof(x));
return y;
}
long double_to_long(double x)
{
long y;
y = *(long*)&x;
return y;
}
long double_to_long(double x)
{
union
{
double x;
long y;
}
u;
u.x = x;
return u.y;
}
Please note that the second option is not recommended, because it breaks strict-aliasing rule.

There are four basic transformations from floating-point to integer types:
floor - Rounds towards negative infinity, i.e. next lowest integer.
ceil[ing] - Rounds towards positive infinity, i.e. next highest integer.
trunc[ate] - Rounds towards zero, i.e. strips the floating-point portion and leaves the integer.
round - Rounds towards the nearest integer.
None of these transformations will give the behaviour you specify, but floor will permit the slightly weaker condition (a < b) implies (c <= d).
If a double value uses more space to represent than a long, then there is no mapping that can meet your initial constraint, thanks to the pigeonhole principle. Basically, since the double type can represent many more distinct values than a long type, there is no way to preserve the strict partial order of the < relationship, as multiple double values would be forced to map to the same long value.
See also:
Difference between Math.Floor() and Math.Truncate() (Stack Overflow)
Pigeonhole principle (Wikipedia)

Use frexp() to get you mostly there. It splits the number into exponent and significand (fraction).
Assume long is at least the same size as double, other-wise this is pointless. Pigeonhole principle.
#include <math.h>
long f(double x) {
assert(sizeof(long) >= sizeof(double));
#define EXPOWIDTH 11
#define FRACWIDTH 52
int ipart;
double fraction = frexp(fabs(x), &ipart);
long lg = ipart;
lg += (1L << EXPOWIDTH)/2;
if (lg < 0) ipart = 0;
if (lg >= (1L << EXPOWIDTH)) lg = (1L << EXPOWIDTH) - 1;
lg <<= FRACWIDTH;
lg += (long) (fraction * (1L << FRACWIDTH));
if (x < 0) {
lg = -lg;
}
return lg;
}
-
Notes:
The proper value for EXPO depends on DBL_MAX_EXP and DBL_MIN_EXP and particulars of the double type.
This solution maps the same double values near the extremes of double. I will look and test more later.
Otherwise as commented above: overlay the two types.
As long is often 2's complement and double is laid out in a sign-magnitude fashion, extra work is need when the double is negative. Also watch out for -0.0.
long f(double x) {
assert(sizeof x == sizeof (long));
union {
double d;
long lg;
} u = { x*1.0 }; // *1.0 gets rid of -0.0
// If 2's complement - which is the common situation
if (u.lg < 0) {
u.lg = LONG_MAX - u.lg;
}
return u.lg;
}

pow() seems to be out by one here

What's going on here:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %lf\n", pow(17, 12));
printf("17^13 = %lf\n", pow(17, 13));
printf("17^14 = %lf\n", pow(17, 14));
}
I get this output:
17^12 = 582622237229761.000000
17^13 = 9904578032905936.000000
17^14 = 168377826559400928.000000
13 and 14 do not match with wolfram alpa cf:
12: 582622237229761.000000
582622237229761
13: 9904578032905936.000000
9904578032905937
14: 168377826559400928.000000
168377826559400929
Moreover, it's not wrong by some strange fraction - it's wrong by exactly one!
If this is down to me reaching the limits of what pow() can do for me, is there an alternative that can calculate this? I need a function that can calculate x^y, where x^y is always less than ULLONG_MAX.

pow works with double numbers. These represent numbers of the form s * 2^e where s is a 53 bit integer. Therefore double can store all integers below 2^53, but only some integers above 2^53. In particular, it can only represent even numbers > 2^53, since for e > 0 the value is always a multiple of 2.
17^13 needs 54 bits to represent exactly, so e is set to 1 and hence the calculated value becomes even number. The correct value is odd, so it's not surprising it's off by one. Likewise, 17^14 takes 58 bits to represent. That it too is off by one is a lucky coincidence (as long as you don't apply too much number theory), it just happens to be one off from a multiple of 32, which is the granularity at which double numbers of that magnitude are rounded.
For exact integer exponentiation, you should use integers all the way. Write your own double-free exponentiation routine. Use exponentiation by squaring if y can be large, but I assume it's always less than 64, making this issue moot.

The numbers you get are too big to be represented with a double accurately. A double-precision floating-point number has essentially 53 significant binary digits and can represent all integers up to 2^53 or 9,007,199,254,740,992.
For higher numbers, the last digits get truncated and the result of your calculation is rounded to the next number that can be represented as a double. For 17^13, which is only slightly above the limit, this is the closest even number. For numbers greater than 2^54 this is the closest number that is divisible by four, and so on.

If your input arguments are non-negative integers, then you can implement your own pow.
Recursively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
if (y == 0)
return 1;
if (y == 1)
return x;
return pow(x,y/2)*pow(x,y-y/2);
}
Iteratively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y--)
res *= x;
return res;
}
Efficiently:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y > 0)
{
if (y & 1)
res *= x;
y >>= 1;
x *= x;
}
return res;
}

A small addition to other good answers: under x86 architecture there is usually available x87 80-bit extended format, which is supported by most C compilers via the long double type. This format allows to operate with integer numbers up to 2^64 without gaps.
There is analogue of pow() in <math.h> which is intended for operating with long double numbers - powl(). It should also be noticed that the format specifier for the long double values is other than for double ones - %Lf. So the correct program using the long double type looks like this:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %Lf\n", powl(17, 12));
printf("17^13 = %Lf\n", powl(17, 13));
printf("17^14 = %Lf\n", powl(17, 14));
}
As Stephen Canon noted in comments there is no guarantee that this program should give exact result.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Round to IEEE 754 precision but keep binary format - c

Related

Problem with float and int multiplication in C [duplicate]

Print float without using printf() in c

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

function to map a double into a long number

pow() seems to be out by one here

Categories

Resources