double/float conversion in C - c

I have this code
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
int main()
{
double a=1/3;
double b=1.0/3.0;
double c=1.0f/3.0f;
printf("a = %20.15lf, b = %20.15lf, c = %20.15lf\n", a,b,c);
float d=1/3;
float e=1.0/3.0;
float f=1.0f/3.0f;
printf("d = %20.15f, e = %20.15f, f = %20.15f\n", d,e,f);
double g=Third*3.0;
double h=ThirdFloat*3.0;
float i=ThirdFloat*3.0f;
printf("(1/3)*3: g = %20.15lf; h = %20.15lf, i = %20.15f\n", g, h, i);
}
Which gives that output
a = 0.000000000000000, b = 0.333333333333333, c = 0.333333343267441
d = 0.000000000000000, e = 0.333333343267441, f = 0.333333343267441
(1/3)*3: g = 1.000000000000000; h = 1.000000029802322, i = 1.000000000000000
I assume that output for a and d looks like this because compiler casts integer value to float after division.
b looks good, e is wrong because of low float precision, so as c and f.
But i have no idea why g has correct value (i thought that 1.0/3.0 = 1.0lf/3.0lf, but then i should be wrong) and why h isn't the same as i.

Let us first look closer: use "%.17e" (approximate decimal) and "%a" (exact).
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
#define FMT "%.17e, %a"
int main(void) {
double a=1/3;
double b=1.0/3.0;
double c=1.0f/3.0f;
printf("a = " FMT "\n", a,a);
printf("b = " FMT "\n", b,b);
printf("c = " FMT "\n", c,c);
puts("");
float d=1/3;
float e=1.0/3.0;
float f=1.0f/3.0f;
printf("d = " FMT "\n", d,d);
printf("e = " FMT "\n", e,e);
printf("f = " FMT "\n", f,f);
puts("");
double g=Third*3.0;
double h=ThirdFloat*3.0;
float i=ThirdFloat*3.0f;
printf("g = " FMT "\n", g,g);
printf("h = " FMT "\n", h,h);
printf("i = " FMT "\n", i,i);
}
Output
a = 0.00000000000000000e+00, 0x0p+0
b = 3.33333333333333315e-01, 0x1.5555555555555p-2
c = 3.33333343267440796e-01, 0x1.555556p-2
d = 0.00000000000000000e+00, 0x0p+0
e = 3.33333343267440796e-01, 0x1.555556p-2
f = 3.33333343267440796e-01, 0x1.555556p-2
g = 1.00000000000000000e+00, 0x1p+0
h = 1.00000002980232239e+00, 0x1.0000008p+0
i = 1.00000000000000000e+00, 0x1p+0
But i have no idea why g has correct value
(1.0/3.0)*3.0 can evaluate as a double at compiler or run time and the rounded result is exactly 1.0.
(1.0/3.0)*3.0 can evaluate at compiler or run time using wider than double math and the rounded result is exactly 1.0. Research FLT_EVAL_METHOD.
and why h isn't the same as i.
(1.0f/3.0f) can use float math to form the float quotient that is noticeably different than one-third: 0.333333343267.... a final *3.0 is not surprisingly different that 1.0.
The outputs are all correct. We need to see why the expectation was amiss.
OP further asks: "Why is h (float * double) less accurate than i (float * float)?"
Both start with 0.333333343267... * 3.0, not one-third * 3.0.
float * double is more accurate. Both form a product, yet float * float is a float product rounded to the nearest 1 part in 224 whereas the more accurate float * double product is a double and rounds to the nearest 1 part in 253. The float * float round to 1.0000000 whereas float * double rounds to 1.0000000298...

But i have no idea why g has correct value (i thought that 1.0/3.0 = 1.0lf/3.0lf
G has exactly the value it should based on:
#define Third (1.0/3.0)
...
double g=Third*3.0;
which is g=(1.0/3.0)*3.0;
Which is 1.000000000000000 (when printed with "%20.15lf")

I think i got the answer.
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
printf("%20.15f, %20.15lf\n", ThirdFloat*3.0, ThirdFloat*3.0);//float*double
printf("%20.15f, %20.15lf\n", ThirdFloat*3.0f, ThirdFloat*3.0f);//float*float
printf("%20.15f, %20.15lf\n", Third*3.0, Third*3.0);//double*double
printf("%20.15f, %20.15lf\n\n", Third*3.0f, Third*3.0f);//float*float
printf("%20.15f, %20.15lf\n", Third, Third);
printf("%20.15f, %20.15lf\n", ThirdFloat, ThirdFloat);
printf("%20.15f, %20.15lf\n", 3.0, 3.0);
printf("%20.15f, %20.15lf\n", 3.0f, 3.0f);
And output:
1.000000029802322, 1.000000029802322
1.000000000000000, 1.000000000000000
1.000000000000000, 1.000000000000000
1.000000000000000, 1.000000000000000
0.333333333333333, 0.333333333333333
0.333333343267441, 0.333333343267441
3.000000000000000, 3.000000000000000
3.000000000000000, 3.000000000000000
First line is not accurate because of the limitations of float. Constant ThirdFloat has really low precision, so when multiplied by double, compiler takes this really bad approximation (0.333333343267441), converts it into double and multiplies by 3.0 given by double, and that gives also wrong result (1.000000029802322).
But if ThirdFloat, which is float, is multiplied by 3.0f, which is float as well, compiler can avoid approximation by taking exact value of 1/3 and multiply it by 3, that's why i got exact result.

Related

Problem with float and int multiplication in C [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 months ago.
I'm using the online compiler https://www.onlinegdb.com/ and in the following code when I multiply 2.1 with 100 the output becomes 209 instead of 210.
#include<stdio.h>
#include <stdint.h>
int main()
{
float x = 1.8;
x = x + 0.3;
int coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = (uint16_t)(x * coefficient);
printf("y: %d\n", y);
return 0;
}
Where am I doing wrong? And what should I do to obtain 210?
I tried to all different type casts still doesn't work.
The following assumes the compiler uses IEEE-754 binary32 and binary64 for float and double, which is overwhelmingly common.
float x = 1.8;
Since 1.8 is a double constant, the compiler converts 1.8 to the nearest double value, 1.8000000000000000444089209850062616169452667236328125. Then, to assign it to the float x, it converts that to the nearest float value, 1.7999999523162841796875.
x = x + 0.3;
The compiler converts 0.3 to the nearest double value, 0.299999999999999988897769753748434595763683319091796875. Then it adds x and that value using double arithmetic, which produces 2.09999995231628400205181605997495353221893310546875.
Then, to assign that to x, it converts it to the nearest float value, 2.099999904632568359375.
uint16_t y = (uint16_t)(x * coefficient);
Since x is float and coefficient is int, the compiler converts the coefficient to float and performs the multiplication using float arithmetic. This produces 209.9999847412109375.
Then the conversion to uint16_t truncates the number, producing 209.
One way to get 210 instead is to use uint16_t y = lroundf(x * coefficient);. (lroundf is declared in <math.h>.) However, to determine what the right way is, you should explain what these numbers are and why you are doing this arithmetic with them.
Floating point numbers are not exact, when you add 1.8 + 0.3,
the FPU might generate a slightly different result from the expected 2.1 (by margin smaller then float Epsilon)
read more about floating-point numbers representation in wiki https://en.wikipedia.org/wiki/Machine_epsilon
what happens to you is:
1.8 + 0.3 = 209.09999999...
then you truncate it to int resulting in 209
you might find this question also relevant to you Why float.Epsilon and not zero? might be
#include<stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
float x = 1.8;
x = x + 0.3;
uint16_t coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = round(x * coefficient);
printf("y: %" PRIu16 "\n", y);
return 0;
}

How do I write float max in float literal form and parse it?

Float max/min is
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368
Compiling to assembly I see the literal is 0xffefffffffffffff. I am unable to understand how to write it in a float literal form. I tried -0xFFFFFFFFFFFFFp972 which resulted in 0xFFEFFFFFFFFFFFFE. Notice the last digit is E instead of F. I have no idea why the last bit is wrong or why 972 gave me the closest number. I didn't understand what I should be doing with the exponent bias either. I used 13 F's because that would set 52bits (the amount of bits in the mantissa) but everything else I'm clueless on
I want to be able to write double min/max as a literal and be able to understand it enough so I can parse it into a 8byte hex value
How do I write float max as a float literal?
Use FLT_MAX. If making your own code, use exponential notation either as hex (preferred) or decimal. If in decimal, use FLT_DECIMAL_DIG significant digits. Any more is not informative. Append an f.
#include <float.h>
#include <stdio.h>
int main(void) {
printf("%a\n", FLT_MAX);
printf("%.*g\n", FLT_DECIMAL_DIG, FLT_MAX);
float m0 = FLT_MAX;
float m1 = 0x1.fffffep+127f;
float m2 = 3.40282347e+38f;
printf("%d %d\n", m1 == m0, m2 == m0);
}
Sample output
0x1.fffffep+127
3.40282347e+38
1 1
Likewise for double, yet no f.
printf("%a\n", DBL_MAX);
printf("%.*g\n", DBL_DECIMAL_DIG, DBL_MAX);
0x1.fffffffffffffp+1023
1.7976931348623157e+308
double m0 = FLT_MAX;
double m1 = 0x1.fffffffffffffp+1023;
double m2 = 1.7976931348623157e+308;
Rare machines will have different max values.

Floating-point-to-integer conversion rounding up instead of truncating

I was surprised to find that a floating-point-to-integer conversion rounded up instead of truncating the fractional part. Here is some sample code, compiled using Clang, that reproduces that behavior:
double a = 1.12; // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024; // 1 * 2^50
double c = a * b; // 1.1200000000000001 * 2^50
long long d = c; // 1261007895663739
Using exact math, the floating-point value represents
1.1200000000000001 * 2^50 = 1261007895663738.9925899906842624
I was expecting the resulting integer to be 1261007895663738 due to truncation but it is actually 1261007895663739. Why?
Assuming IEEE 754 double precision, 1.12 is exactly
1.12000000000000010658141036401502788066864013671875
Written in binary, its significand is exactly:
1.0001111010111000010100011110101110000101000111101100
Note the last two zeros are intentional, since it's what you get with double precision (1 bit before fraction separator, plus 52 fractional bits).
So, if you shift by 50 places, you'll get an integer value
100011110101110000101000111101011100001010001111011.00
or in decimal
1261007895663739
when converting to long long, no truncation/rounding occurs, the conversion is exact.
Using exact math, the floating-point value represents ...
a is not exactly 1.12 as 0.12 is not dyadic.
// `a` not exactly 1.12
double a = 1.12; // 1.1200000000000001 * 2^0
Nearby double values:
1.11999999999999988... Next closest double
1.12 Code
1.12000000000000011... Closest double
1.12000000000000033...
Instead, let us look closer to truer values.
#include <stdio.h>
#include <float.h>
int main() {
double a = 1.12; // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024; // 1 * 2^50
int prec = DBL_DECIMAL_DIG;
printf("a %.*e\n", prec, a);
printf("b %.*e\n", prec, b);
double c = a * b;
double whole;
printf("c %.*e (r:%g)\n", prec, c, modf(c, &whole));
long long d = (long long) c;
printf("d %lld\n", d);
}
Output
a 1.12000000000000011e+00
b 1.12589990684262400e+15
c 1.26100789566373900e+15 (r:0)
d 1261007895663739

Smallest number added to FLT_MAX to cause overflow

I need to find number which is a power of 2 that when added to FLT_MAX will cause overflow. However, when I printf very large power, like 2^300, inf still doesn't appear. Also, I thought that as FLT_MAX is the maximum floating point represented, adding 1 to it will cause overflow immediately.
#include <stdio.h>
#include <float.h>
int main(){
float f = FLT_MAX;
printf("%f", f + pow(2,300));
}
Any help would be appreciated. Thanks!
The answer is (FLT_MAX - nextafterf(FLT_MAX, 0))/2, that is, exactly 0x1p+103 or approximately 1.014120480e+31.
There is a mistake in the method you use to determine the answer : the standard function pow returns a double, and C's “usual arithmetic conversions” (C11 6.3.1.8:1) mean that the expression f + pow(2,300) is computed as a double. It is then printed as a double because of how arguments are passed to variadic functions.
This C program shows how you can arrive to the float value that, added to FLT_MAX with float addition, results in float infinity:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(){
float f = FLT_MAX;
printf("FLT_MAX: %a\n", f);
float b = nextafterf(f, 0);
printf("number before FLT_MAX: %a\n", b);
float d = f - b;
printf("difference: %a\n", d);
printf("FLT_MAX + d: %a\n", f + d);
printf("FLT_MAX + d/2: %a\n", f + d/2);
printf("FLT_MAX + nextafterf(d/2,0): %a\n", f + nextafterf(d/2,0));
float answer = d/2;
printf("answer: %a %.9e\n", answer, answer);
}
It prints:
FLT_MAX: 0x1.fffffep+127
number before FLT_MAX: 0x1.fffffcp+127
difference: 0x1p+104
FLT_MAX + d: inf
FLT_MAX + d/2: inf
FLT_MAX + nextafterf(d/2,0): 0x1.fffffep+127
answer: 0x1p+103 1.014120480e+31
It shows that if you take the difference between FLT_MAX and its lower neighbor (call this difference d), as you could expect, d added to FLT_MAX produces inf. But this is not the smallest float you can add to FLT_MAX to produce inf—there are smaller candidates. It is enough to add exactly half of d to FLT_MAX in order for the result to tound up to inf. If you add less than that, on the other hand, the result is rounded down to FLT_MAX.
This line is working with double not float.
printf("%f", f + pow(2,300));
To be working with float you need
printf("%f", f + powf(2,300));
and in this case the output is
inf
In the second case the float result is promoted to double in the call to printf, but it's too late, the value is already in an overflow representation.
//float=(-1) ^ s * 2 ^ (x - 127) * (1 + n * 2 ^ -23)
// s xxxxxxxx nnnnnnnnnnnnnnnnnnnnnnn
//FLT_MAX 3.402823466e+38F 2 ^ 128 0 11111110 11111111111111111111111
//FLT_MIN 1.175494351e-38F 2 ^ -126 0 00000001 00000000000000000000000
//FLT_TRUE_MIN 1.401298464e-45F 2 ^ -149 0 00000000 00000000000000000000001
//ONE 1f 2 ^ 0 0 01111111 00000000000000000000000
//INFINITY - 2 ^ 128+ 0 11111111 00000000000000000000000
union
{
float f;
int i;
}k,k2,k3;
k.i = 0b01111111011111111111111111111111; // 2^128 FLT_MAX
k2.i = 0b01110011000000000000000000000000; // 2^103
k3.f = k.f + k2.f; // 2^128+ INFINITY

How do I get C to differentiate between single and double floats when evaluating trig functions?

I'm very new to C programming. I have to print the values of tan(0) to tan(pi/2) in steps of pi/20 for both single and double precision floats. However, when I use different data types to store the floats, nothing changes between single and double, and I expected the number of digits to change.
#include <stdio.h>
#include <math.h>
#define PI 3.1415926
int main()
{
float angle = 0.0;
float pi_single = 3.1415926;
printf("single precision:\n");
while(angle < pi_single/2){
float(tanangle) = 0.0;
tanangle = tan(angle);
printf("tan(%f) = %f\n", angle, tanangle);
angle = angle + pi_single/20;
}
double angle2 = 0.0;
double pi_double = 3.141592653589793;
printf("double precision:\n");
while(angle2 < pi_double/2 ){
double(tanangle2) = 0.0;
tanangle2 = tan(angle2);
printf("tan(%lf) = %lf\n", angle2, tanangle2);
angle2 = angle2 + pi_double/20;
}
return 0;
}
I'm trying to replicate the result of this Python program:
import numpy as np
theta_32 = np.arange(0, np.pi/2+np.pi/20, np.pi/20, dtype = 'float32')
print('single precision')
for theta in theta_32:
print(theta)
print(np.tan(theta))
print()
[enter image description here][1]
theta_64 = np.arange(0, np.pi/2+np.pi/20, np.pi/20, dtype = 'float64')
print('double precision')
for theta_new in theta_64:
print(theta_new)
print(np.tan(theta_new))
You need to use tanf if you want the computation to take place in float rather than double.
Note also that when working with floating point, it's best to use an integral type, int i, say to count from 0 to 19, then use pi * i / 20 for the angle. For further reading see Is floating point math broken?
Two main issues here:
First, the tan function takes a double and returns a double. Use tanf instead which takes a float and returns a float. So change this:
tanangle = tan(angle);
To this:
tanangle = tanf(angle);
Second, the %f format specifier by default prints 6 digits of precision. That's not enough to see the different between single and double precision floating point. Expand the precision to say 15 digits and you'll see a difference. So then this:
printf("tan(%f) = %f\n", angle, tanangle);
Becomes:
printf("tan(%.15f) = %.15f\n", angle, tanangle);
And this:
printf("tan(%lf) = %lf\n", angle2, tanangle2);
Becomes:
printf("tan(%.15lf) = %.15lf\n", angle2, tanangle2);
I expected the number of digits to change
printf() is controlling the number of digits printed, not the type of the variable.
Use "%e" to see the floating point nature of a float rather than fixed with "%f".
To see all the useful digits, rather than print to the default precision of 6, use a precision modifier.
xxx_DECIMAL_DIG - 1 is the number of decimal digits to use with "%e" to see a total of xxx_DECIMAL_DIG significant digits. It is the number of digits need to distinguish all of that floating point type. ref
#include <float.h>
// printf("tan(%f) = %f\n", angle, tanangle);
printf("tan(%.*e) = %.*e\n", FLT_DECIMAL_DIG - 1, angle, DBL_DECIMAL_DIG - 1, tanangle);
// printf("tan(%f) = %f\n", angle2, tanangle2);
printf("tan(%.*e) = %.*e\n", DBL_DECIMAL_DIG - 1, angle2, DBL_DECIMAL_DIG - 1, tanangle2);
Code could use atanf() to get the float result when the result is saved in a float.
// tanangle = tan(angle);`
tanangle = tanf(angle);`
printf("tan(%.*e) = %.*e\n", FLT_DECIMAL_DIG - 1, angle, FLT_DECIMAL_DIG - 1, tanangle);
Rather than approximate pi with fixed values, let the system calculate the best.
//float pi_single = 3.1415926;
//double pi_double = 3.141592653589793;
float pi_single = acosf(-1);
double pi_double = acos(-1);
Better to use an integer loop
for (int i=0; i<= 20; i++) {
float angle = i*pi_single/20;
float tanangle = tanf(angle);
printf("tan(%.*e) = %.*e\n",
FLT_DECIMAL_DIG - 1, angle,
FLT_DECIMAL_DIG - 1, tanangle);
}

Resources