Floating-point-to-integer conversion rounding up instead of truncating - c

I was surprised to find that a floating-point-to-integer conversion rounded up instead of truncating the fractional part. Here is some sample code, compiled using Clang, that reproduces that behavior:
double a = 1.12; // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024; // 1 * 2^50
double c = a * b; // 1.1200000000000001 * 2^50
long long d = c; // 1261007895663739
Using exact math, the floating-point value represents
1.1200000000000001 * 2^50 = 1261007895663738.9925899906842624
I was expecting the resulting integer to be 1261007895663738 due to truncation but it is actually 1261007895663739. Why?

Assuming IEEE 754 double precision, 1.12 is exactly
1.12000000000000010658141036401502788066864013671875
Written in binary, its significand is exactly:
1.0001111010111000010100011110101110000101000111101100
Note the last two zeros are intentional, since it's what you get with double precision (1 bit before fraction separator, plus 52 fractional bits).
So, if you shift by 50 places, you'll get an integer value
100011110101110000101000111101011100001010001111011.00
or in decimal
1261007895663739
when converting to long long, no truncation/rounding occurs, the conversion is exact.

Using exact math, the floating-point value represents ...
a is not exactly 1.12 as 0.12 is not dyadic.
// `a` not exactly 1.12
double a = 1.12; // 1.1200000000000001 * 2^0
Nearby double values:
1.11999999999999988... Next closest double
1.12 Code
1.12000000000000011... Closest double
1.12000000000000033...
Instead, let us look closer to truer values.
#include <stdio.h>
#include <float.h>
int main() {
double a = 1.12; // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024; // 1 * 2^50
int prec = DBL_DECIMAL_DIG;
printf("a %.*e\n", prec, a);
printf("b %.*e\n", prec, b);
double c = a * b;
double whole;
printf("c %.*e (r:%g)\n", prec, c, modf(c, &whole));
long long d = (long long) c;
printf("d %lld\n", d);
}
Output
a 1.12000000000000011e+00
b 1.12589990684262400e+15
c 1.26100789566373900e+15 (r:0)
d 1261007895663739

Related

Problem with float and int multiplication in C [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 months ago.
I'm using the online compiler https://www.onlinegdb.com/ and in the following code when I multiply 2.1 with 100 the output becomes 209 instead of 210.
#include<stdio.h>
#include <stdint.h>
int main()
{
float x = 1.8;
x = x + 0.3;
int coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = (uint16_t)(x * coefficient);
printf("y: %d\n", y);
return 0;
}
Where am I doing wrong? And what should I do to obtain 210?
I tried to all different type casts still doesn't work.
The following assumes the compiler uses IEEE-754 binary32 and binary64 for float and double, which is overwhelmingly common.
float x = 1.8;
Since 1.8 is a double constant, the compiler converts 1.8 to the nearest double value, 1.8000000000000000444089209850062616169452667236328125. Then, to assign it to the float x, it converts that to the nearest float value, 1.7999999523162841796875.
x = x + 0.3;
The compiler converts 0.3 to the nearest double value, 0.299999999999999988897769753748434595763683319091796875. Then it adds x and that value using double arithmetic, which produces 2.09999995231628400205181605997495353221893310546875.
Then, to assign that to x, it converts it to the nearest float value, 2.099999904632568359375.
uint16_t y = (uint16_t)(x * coefficient);
Since x is float and coefficient is int, the compiler converts the coefficient to float and performs the multiplication using float arithmetic. This produces 209.9999847412109375.
Then the conversion to uint16_t truncates the number, producing 209.
One way to get 210 instead is to use uint16_t y = lroundf(x * coefficient);. (lroundf is declared in <math.h>.) However, to determine what the right way is, you should explain what these numbers are and why you are doing this arithmetic with them.
Floating point numbers are not exact, when you add 1.8 + 0.3,
the FPU might generate a slightly different result from the expected 2.1 (by margin smaller then float Epsilon)
read more about floating-point numbers representation in wiki https://en.wikipedia.org/wiki/Machine_epsilon
what happens to you is:
1.8 + 0.3 = 209.09999999...
then you truncate it to int resulting in 209
you might find this question also relevant to you Why float.Epsilon and not zero? might be
#include<stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
float x = 1.8;
x = x + 0.3;
uint16_t coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = round(x * coefficient);
printf("y: %" PRIu16 "\n", y);
return 0;
}

Return a float value when using int parameters C

I have this issue where I need to print float number while I use int parameters in function.
float lift_a_car(const int stick_length, const int human_weight, const int car_weight) {
return (stick_length*human_weight)/(car_weight+human_weight);
}
I'm checking it by using:
printf("%.4f\n", lift_a_car(2, 80, 1400));
It only returns 0.0000.
The reason for this is, that in C every calculation is made with the most complex type in use. In your case this type is int, because int/int is treated as an integer division. The same for the addition and multiplication. To fix this, you have to cast the integers to floats explicitly, otherwise it will only be done at the end.
Your code return (stick_length*human_weight)/(car_weight+human_weight); equals the following operation:
int t1 = stick_length * human_weight; // 2 * 80 = 160
int t2 = car_weight * human_weight; // 1400 * 80 = 112000
int t3 = t1 / t2; (integer division) // 160 / 112000 = 0
return (float) t3;
But what you want is to do this:
return ((float) stick_length*human_weight)/((float) car_weight+human_weight);
// or
return (float) (stick_length*human_weight)/(car_weight+human_weight);
This will be evaluated like:
float t1 = (float) stick_length * human_weight; // 2.0f * 80 = 160.0f
float t2 = (float) car_weight + human_weight; // 160.0f * 140 = 112000.0f
float t3 = t1 / t2; (floating division) // 160.0f / 112000.0f = 0.0014...
// or
int t1 = stick_length * human_weight; // 2 * 80 = 160
int t2 = car_weight + human_weight; // 160 * 140 = 112000
float t3 = (float) t1 / t2; (floating division) // 160.0f / 140 = 0.0014...
The calculation
(stick_length*human_weight)/(car_weight+human_weight)
is an all integer calculation with an integer result. You should cast at least one of the variables or intermediate results to a floating point value.
Like for example
(float) (stick_length*human_weight)/(car_weight+human_weight)
That will convert the result of stick_length*human_weight into a float value, making the division a floating-point operation with a floating-point result.
Although the answer of our programming dude is correct, it looks quite dangerous:
Indeed, this is integer arithmetic:
(stick_length*human_weight)/(car_weight+human_weight)
Indeed, this is floating point arithmetic:
(float) (stick_length*human_weight)/(car_weight+human_weight)
But why? Simply because typecasting precedes multiplication (or division): it's the same as:
((float) (stick_length*human_weight))/(car_weight+human_weight)
But beginners might not be aware of that and might start to do stuff like:
(float) ((stick_length*human_weight)/(car_weight+human_weight))
=> which will again give a bad result.
Therefore I would propose to perform a typecasting, as narrow as possible, something like:
((float) stick_length*human_weight)/(car_weight+human_weight)
Type-casting while returning the calculated value should help you. Otherwise, you are returning 160/1480 which is 0 unless type-casting is not applied.
This code should help:
float lift_a_car(const int stick_length, const int human_weight, const int car_weight)
{
return ((float)(stick_length*human_weight))/((float)(car_weight+human_weight));
}
OP's code is doing integer division and so gets a truncated quotient.
For fun, a non-floating point, cast-less, integer approach. Useful in small processors where FP support is expensive.
void print_lift_a_car(int stick_length, int human_weight, int car_weight) {
// Use wider types like long or long long to avoid overflow.
// Scale integer math by 10000 as "%.4f" was desired.
long num = 10000L * stick_length * human_weight;
long den = 0L + car_weight + human_weight; // A long instead of an int addition.
long ratio = (num + den/2) / den; // Rounded division.
printf("%ld.%04ld\n", ratio/10000, ratio%10000);
}
May also want to handle den == 0 and ratio < 0 with additional code.

Round to IEEE 754 precision but keep binary format

If I convert the decimal number 3120.0005 to float (32-bit) representation, the number gets rounded down to 3120.00048828125.
Assuming we're using a fixed point number with a scale of 10^12 then 1000000000000 = 1.0 and 3120000500000000 = 3120.0005.
What would the formula/algorithm be to round down to the nearest IEEE 754 precision to get 3120000488281250?
I would also need a way to get the result of rounding up (3120000732421875).
If you divide by the decimal scaling factor, you'll find your nearest representable float. For rounding the other direction, std::nextafter can be used:
#include <float.h>
#include <math.h>
#include <stdio.h>
long long scale_to_fixed(float f)
{
float intf = truncf(f);
long long result = 1000000000000LL;
result *= (long long)intf;
result += round((f - intf) * 1.0e12);
return result;
}
/* not needed, always good enough to use (float)(n / 1.0e12) */
float scale_from_fixed(long long n)
{
float result = (n % 1000000000000LL) / 1.0e12;
result += n / 1000000000000LL;
return result;
}
int main()
{
long long x = 3120000500000000;
float x_reduced = scale_from_fixed(x);
long long y1 = scale_to_fixed(x_reduced);
long long yfloor = y1, yceil = y1;
if (y1 < x) {
yceil = scale_to_fixed(nextafterf(x_reduced, FLT_MAX));
}
else if (y1 > x) {
yfloor = scale_to_fixed(nextafterf(x_reduced, -FLT_MAX));
}
printf("%lld\n%lld\n%lld\n", yfloor, x, yceil);
}
Results:
3120000488281250
3120000500000000
3120000732421875
In order to handle the values as float scaled by 1e12 and compute the next larger power of two, e.g. "rounding up (3120000732421875)", the key is understanding that you are looking for the next larger power of two from the 32-bit representation of x / 1.0e12. While you can mathematically arrive at this value, a union between float and unsigned (or uint32_t) provides a direct way to interpret the stored 32-bit value for the floating-point number as an unsigned value.1
A simple example utilizing a the union prev to hold the reduced value of x and a separate instance next holding the unsigned value (+1) can be:
#include <stdio.h>
#include <inttypes.h>
int main (void) {
uint64_t x = 3120000500000000;
union { /* union between float and uint32_t */
float f;
uint32_t u;
} prev = { .f = x / 1.0e12 }, /* x reduced to float, pwr of 2 as .u */
next = { .u = prev.u + 1u }; /* 2nd union, increment pwr of 2 by 1 */
printf ("prev : %" PRIu64 "\n x : %" PRIu64 "\nnext : %" PRIu64 "\n",
(uint64_t)(prev.f * 1e12), x, (uint64_t)(next.f * 1e12));
}
Example Use/Output
$ ./bin/pwr2_prev_next
prev : 3120000488281250
x : 3120000500000000
next : 3120000732421875
Footnotes:
1. As an alternative, you can use a pointer to char to hold the address of the floating point type and interpret the 4-byte value stored at that location as unsigned without running afoul of C11 Standard - §6.5 Expressions (p6,7) (the "Strict Aliasing Rule"), but the use of a union is preferred.

Why is the function floor giving different results in this case?

In this example, the behaviour of floor differs and I do not understand why:
printf("floor(34000000.535 * 100 + 0.5) : %lf \n", floor(34000000.535 * 100 + 0.5));
printf("floor(33000000.535 * 100 + 0.5) : %lf \n", floor(33000000.535 * 100 + 0.5));
The output for this code is:
floor(34000000.535 * 100 + 0.5) : 3400000053.000000
floor(33000000.535 * 100 + 0.5) : 3300000054.000000
Why does the first result not equal to 3400000054.0 as we could expect?
double in C does not represent every possible number that can be expressed in text.
double can typically represent about 264 different numbers. Neither 34000000.535 nor 33000000.535 are in that set when double is encoded as a binary floating point number. Instead the closest representable number is used.
Text 34000000.535
closest double 34000000.534999996423...
Text 33000000.535
closest double 33000000.535000000149...
With double as a binary floating point number, multiplying by a non-power-of-2, like 100.0, can introduce additional rounding differences. Yet in these cases, it still results in products, one just above xxx.5 and another below.
Adding 0.5, a simple power of 2, does not incurring rounding issues as the value is not extreme compared to 3x00000053.5.
Seeing intermediate results to higher print precision well shows the typical step-by-step process.
#include <stdio.h>
#include <float.h>
#include <math.h>
void fma_test(double a, double b, double c) {
int n = DBL_DIG + 3;
printf("a b c %.*e %.*e %.*e\n", n, a, n, b, n, c);
printf("a*b %.*e\n", n, a*b);
printf("a*b+c %.*e\n", n, a*b+c);
printf("a*b+c %.*e\n", n, floor(a*b+c));
puts("");
}
int main(void) {
fma_test(34000000.535, 100, 0.5);
fma_test(33000000.535, 100, 0.5);
}
Output
a b c 3.400000053499999642e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.400000053499999523e+09
a*b+c 3.400000053999999523e+09
a*b+c 3.400000053000000000e+09
a b c 3.300000053500000015e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.300000053500000000e+09
a*b+c 3.300000054000000000e+09
a*b+c 3.300000054000000000e+09
The issue is more complex then this simple answers as various platforms can 1) use higher precision math like long double or 2) rarely, use a decimal floating point double. So code's results may vary.
Question has been already answered here.
In basic float numbers are just approximation. If we have program like this:
float a = 0.2 + 0.3;
float b = 0.25 + 0.25;
if (a == b) {
//might happen
}
if (a != b) {
// also might happen
}
The only guaranteed thing is that a-b is relatively small.

How to compute sine wave with accuracy over the time

Use case is to generate a sine wave for digital synthesis, so, we need to compute all values of sin(d t) where:
t is an integer number, representing the sample number. This is variable. Range is from 0 to 158,760,000 for one hour sound of CD quality.
d is double, representing the delta of the angle. This is constant. And the range is: greater than 0 , less than pi.
Goal is to achieve high accuracy with traditional int and double data types. Performance is not important.
Naive implementation is:
double next()
{
t++;
return sin( ((double) t) * (d) );
}
But, the problem is when t increases, accuracy gets reduced because big numbers provided to "sin" function.
An improved version is the following:
double next()
{
d_sum += d;
if (d_sum >= (M_PI*2)) d_sum -= (M_PI*2);
return sin(d_sum);
}
Here, I make sure to provide numbers in range from 0 to 2*pi to the "sin" function.
But, now, the problem is when d is small, there are many small additions which decreases the accuracy every time.
The question here is how to improve the accuracy.
Appendix 1
"accuracy gets reduced because big numbers provided to "sin" function":
#include <stdio.h>
#include <math.h>
#define TEST (300000006.7846112)
#define TEST_MOD (0.0463259891528704262050786960234519968548937998410258872449766)
#define SIN_TEST (0.0463094209176730795999323058165987662490610492247070175523420)
int main()
{
double a = sin(TEST);
double b = sin(TEST_MOD);
printf("a=%0.20f \n" , a);
printf("diff=%0.20f \n" , a - SIN_TEST);
printf("b=%0.20f \n" , b);
printf("diff=%0.20f \n" , b - SIN_TEST);
return 0;
}
Output:
a=0.04630944601888796475
diff=0.00000002510121488442
b=0.04630942091767308033
diff=0.00000000000000000000
You can try an approach that is used is some implementations of fast Fourier transformation. Values of trigonometric function are calculated based on previous values and delta.
Sin(A + d) = Sin(A) * Cos(d) + Cos(A) * Sin(d)
Here we have to store and update cosine value too and store constant (for given delta) factors Cos(d) and Sin(d).
Now about precision: cosine(d) for small d is very close to 1, so there is risk of precision loss (there are only few significant digits in numbers like 0.99999987). To overcome this issue, we can store constant factors as
dc = Cos(d) - 1 = - 2 * Sin(d/2)^2
ds = Sin(d)
using another formulas to update current value
(here sa = Sin(A) for current value, ca = Cos(A) for current value)
ts = sa //remember last values
tc = ca
sa = sa * dc + ca * ds
ca = ca * dc - ts * ds
sa = sa + ts
ca = ca + tc
P.S. Some FFT implementations periodically (every K steps) renew sa and ca values through trig. functions to avoid error accumulation.
Example result. Calculations in doubles.
d=0.000125
800000000 iterations
finish angle 100000 radians
cos sin
described method -0.99936080743598 0.03574879796994
Cos,Sin(100000) -0.99936080743821 0.03574879797202
windows Calc -0.9993608074382124518911354141448
0.03574879797201650931647050069581
sin(x) = sin(x + 2N∙π), so the problem can be boiled down to accurately finding a small number which is equal to a large number x modulo 2π.
For example, –1.61059759 ≅ 256 mod 2π, and you can calculate sin(-1.61059759) with more precision than sin(256)
So let's choose some integer number to work with, 256. First find small numbers which are equal to powers of 256, modulo 2π:
// to be calculated once for a given frequency
// approximate hard-coded numbers for d = 1 below:
double modB = -1.61059759; // = 256 mod (2π / d)
double modC = 2.37724612; // = 256² mod (2π / d)
double modD = -0.89396887; // = 256³ mod (2π / d)
and then split your index as a number in base 256:
// split into a base 256 representation
int a = i & 0xff;
int b = (i >> 8) & 0xff;
int c = (i >> 16) & 0xff;
int d = (i >> 24) & 0xff;
You can now find a much smaller number x which is equal to i modulo 2π/d
// use our smaller constants instead of the powers of 256
double x = a + modB * b + modC * c + modD * d;
double the_answer = sin(d * x);
For different values of d you'll have to calculate different values modB, modC and modD, which are equal to those powers of 256, but modulo (2π / d). You could use a high precision library for these couple of calculations.
Scale up the period to 2^64, and do the multiplication using integer arithmetic:
// constants:
double uint64Max = pow(2.0, 64.0);
double sinFactor = 2 * M_PI / (uint64Max);
// scale the period of the waveform up to 2^64
uint64_t multiplier = (uint64_t) floor(0.5 + uint64Max * d / (2.0 * M_PI));
// multiplication with index (implicitly modulo 2^64)
uint64_t x = i * multiplier;
// scale 2^64 down to 2π
double value = sin((double)x * sinFactor);
As long as your period is not billions of samples, the precision of multiplier will be good enough.
The following code keeps the input to the sin() function within a small range, while somewhat reducing the number of small additions or subtractions due to a potentially very tiny phase increment.
double next() {
t0 += 1.0;
d_sum = t0 * d;
if ( d_sum > 2.0 * M_PI ) {
t0 -= (( 2.0 * M_PI ) / d );
}
return (sin(d_sum));
}
For hyper accuracy, OP has 2 problems:
multiplying d by n and maintaining more precision than double. That is answered in the first part below.
Performing a mod of the period. The simple solution is to use degrees and then mod 360, easy enough to do exactly. To do 2*π of large angles is tricky as it needs a value of 2*π with about 27 more bits of accuracy than (double) 2.0 * M_PI
Use 2 doubles to represent d.
Let us assume 32-bit int and binary64 double. So double has 53-bits of accuracy.
0 <= n <= 158,760,000 which is about 227.2. Since double can handle 53-bit unsigned integers continuously and exactly, 53-28 --> 25, any double with only 25 significant bits can be multiplied by n and still be exact.
Segment d into 2 doubles dmsb,dlsb, the 25-most significant digits and the 28- least.
int exp;
double dmsb = frexp(d, &exp); // exact result
dmsb = floor(dmsb * POW2_25); // exact result
dmsb /= POW2_25; // exact result
dmsb *= pow(2, exp); // exact result
double dlsb = d - dmsb; // exact result
Then each multiplication (or successive addition) of dmsb*n will be exact. (this is the important part.) dlsb*n will only error in its least few bits.
double next()
{
d_sum_msb += dmsb; // exact
d_sum_lsb += dlsb;
double angle = fmod(d_sum_msb, M_PI*2); // exact
angle += fmod(d_sum_lsb, M_PI*2);
return sin(angle);
}
Note: fmod(x,y) results are expected to be exact give exact x,y.
#include <stdio.h>
#include <math.h>
#define AS_n 158760000
double AS_d = 300000006.7846112 / AS_n;
double AS_d_sum_msb = 0.0;
double AS_d_sum_lsb = 0.0;
double AS_dmsb = 0.0;
double AS_dlsb = 0.0;
double next() {
AS_d_sum_msb += AS_dmsb; // exact
AS_d_sum_lsb += AS_dlsb;
double angle = fmod(AS_d_sum_msb, M_PI * 2); // exact
angle += fmod(AS_d_sum_lsb, M_PI * 2);
return sin(angle);
}
#define POW2_25 (1U << 25)
int main(void) {
int exp;
AS_dmsb = frexp(AS_d, &exp); // exact result
AS_dmsb = floor(AS_dmsb * POW2_25); // exact result
AS_dmsb /= POW2_25; // exact result
AS_dmsb *= pow(2, exp); // exact result
AS_dlsb = AS_d - AS_dmsb; // exact result
double y;
for (long i = 0; i < AS_n; i++)
y = next();
printf("%.20f\n", y);
}
Output
0.04630942695385031893
Use degrees
Recommend using degrees as 360 degrees is the exact period and M_PI*2 radians is an approximation. C cannot represent π exactly.
If OP still wants to use radians, for further insight on performing the mod of π, see Good to the Last Bit

Resources