How does float operations works in c? Big numbers weird results - c

The following code gives some odd results:
#include <stdio.h>
#include <float.h>
int main()
{
float t = 1.0;
float res;
float myFltMax = 340282346638528859.0;
printf("FLT_MAX %f\n", FLT_MAX);
res = FLT_MAX - t;
printf("res %f\n", res);
res = myFltMax - t;
printf("res myFltMax %f\n", res);
return 1;
}
The results are:
FLT_MAX 340282346638528859811704183484516925440.000000
res 340282346638528859811704183484516925440.000000
res myFltMax 340282356122255360.000000
So, if i subtract 1 from FLT_MAX the result is the same and if i subtract 1 from other big float, the result is greater than initial number.
I am using gcc version 4.7.2.
Thank you.

If you subtract 1 from myFltMax you don't get the difference greater than the initial number. You get the same number. Print myFltMax as well and you'll see that it's 340282356122255360 and not 340282346638528859.
Proof.
Basically, the compiler rounds your 340282346638528859 to the nearest value that can be represented in the floating point type and that happens to be 340282356122255360.

Related

C - Subtraction of numbers with many decimals

Why does the C code below output "Difference: 0.000000" ? I need to make calculations with many decimals in one of my university tasks and I don't understand this because I'm new to programming in C. Am I using the correct type? Thanks in advance.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>
int main() {
long double a = 1.00000001;
long double b = 1.00000000;
long double difference = a-b;
printf("Difference: %Lf", difference);
}
I have tried that code and I'm expecting to get the result: "Difference: 0.00000001"
You see 0.000000 because %Lf prints a fixed number of decimal places, and the default number is 6. In your case, the difference is 1 in the 8th decimal place, which shows as 0.000000 when printed to 6 d.p. Either use %Le or %Lg or specify more precision: %.8Lf.
#include <stdio.h>
int main(void)
{
long double a = 1.00000001;
long double b = 1.00000000;
long double difference = a - b;
printf("Difference: %Lf\n", difference);
printf("Difference: %.8Lf\n", difference);
printf("Difference: %Le\n", difference);
printf("Difference: %Lg\n", difference);
return 0;
}
Note the minimal set of headers.
Output:
Difference: 0.000000
Difference: 0.00000001
Difference: 1.000000e-08
Difference: 1e-08
#include <stdio.h>
int main() {
long double a = 1.000000001;
long double b = 1.000000000;
long double difference = a-b;
printf("Difference: %.9Lf\n", difference);
}
Try this code. Actually, you need to specify to the compiler how much precision you need after the decimal point. Here the .9 will print 9 digits after the decimal point. You can adjust this value according to your needs; just don't exceed the range of the variable.

How do I write float max in float literal form and parse it?

Float max/min is
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368
Compiling to assembly I see the literal is 0xffefffffffffffff. I am unable to understand how to write it in a float literal form. I tried -0xFFFFFFFFFFFFFp972 which resulted in 0xFFEFFFFFFFFFFFFE. Notice the last digit is E instead of F. I have no idea why the last bit is wrong or why 972 gave me the closest number. I didn't understand what I should be doing with the exponent bias either. I used 13 F's because that would set 52bits (the amount of bits in the mantissa) but everything else I'm clueless on
I want to be able to write double min/max as a literal and be able to understand it enough so I can parse it into a 8byte hex value
How do I write float max as a float literal?
Use FLT_MAX. If making your own code, use exponential notation either as hex (preferred) or decimal. If in decimal, use FLT_DECIMAL_DIG significant digits. Any more is not informative. Append an f.
#include <float.h>
#include <stdio.h>
int main(void) {
printf("%a\n", FLT_MAX);
printf("%.*g\n", FLT_DECIMAL_DIG, FLT_MAX);
float m0 = FLT_MAX;
float m1 = 0x1.fffffep+127f;
float m2 = 3.40282347e+38f;
printf("%d %d\n", m1 == m0, m2 == m0);
}
Sample output
0x1.fffffep+127
3.40282347e+38
1 1
Likewise for double, yet no f.
printf("%a\n", DBL_MAX);
printf("%.*g\n", DBL_DECIMAL_DIG, DBL_MAX);
0x1.fffffffffffffp+1023
1.7976931348623157e+308
double m0 = FLT_MAX;
double m1 = 0x1.fffffffffffffp+1023;
double m2 = 1.7976931348623157e+308;
Rare machines will have different max values.

-2.6 to the power of 0.2 outputs #IND00

currently i'm doing my practice for C language and i've found one question about function pow() in C\C++.
#include <stdio.h>
#include <math.h>
int main(){
double k = 0.2;
printf("2.6^k = %f\n", pow(2.6, k));
printf("-2.6^k = %f\n", pow(-2.6, k));
}
OUTPUT:
2.6^k = 1.210583
-2.6^k = -1.#IND00
In this example -2.6 to the power of 0.2 i̶s̶ ̶n̶o̶t̶ ̶e̶v̶e̶n̶ ̶a̶ ̶c̶o̶m̶p̶l̶e̶x̶ ̶n̶u̶m̶b̶e̶r̶(Edit: it is),but output says(as i think) that number is indeterminable.
And in my practice there is the following:
image
I implemented this like that:
/* e = 2.1783; x = -2.6 */
result = pow(cos(pow(x,0.2) - pow(e,-x + sqrt(3))) + 1.61,2);
But due to (-x + sqrt(3)) being negative number it outputs:
-1.#IND00
The value 0.2 cannot be represented exactly in binary floating point. So what you have is not actually 0.2 but a value slightly more than that. This yields a complex result so pow returns NaN.
Reading into this further, section 7.12.7.4 of the C standard regarding the pow function states:
double pow(double x, double y);
A domain error occurs if x is finite
and negative and y is finite and not an integer value.
In the event of a domain error, an implementation-defined value is returned. While MSVC doesn't seem to document what it does in this case, it apparently returns NaN. In the case of Linux, the man pages explicitly state that NaN is returned in this case.
With complex number math, -2.60.2 is 0.979382 +0.711563*i.
pow(-2.6, k) does not have a real answer.
Alternative: use complex math:
#include <complex.h>
#include <math.h>
#include <stdio.h>
int main(void) {
double k = 0.2;
printf("2.6^k = %f\n", pow(2.6, k));
complex double y = cpow(-2.6, k);
printf("-2.6^k = %f %f*i\n", creal(y), cimag(y));
}
Output
2.6^k = 1.210583
-2.6^k = 0.979382 0.711563*i

Return zero and Inf values in C

I use C to do computation using the following code:
#include <stdio.h>
#include <math.h>
void main() {
float x = 3.104924e-33;
int i = 6000, j = 1089;
float value, value_inv;
value = sqrt(x / ((float)i * j));
value_inv = 1. / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
}
We can see, in fact, value = 2.18e-20. This does not exceed the boundary of float data type in C. But why the computer gives me
value = 0.000000e+00
value_inv = inf
Does anybody know why it happens and how to solve this problem without changing data type to double?
OP's float apparently does not support sub-normals. C allows non-support.
Does anybody know why it happens and how to solve this problem without changing data type to double?
This may be a implementation detail or due to a compiler option. Without changing to double, look to a different compiler or options. Look at options concerning sub-normal support, precision used for intermediate calculation and optimization levels (which sometimes short edge change cases like this.)
On my machine which does handle sub-normals, using C11, FLT_TRUE_MIN, smallest non-zero float is smaller than FLT_MIN, the smallest normal non-zero float.
#include<float.h>
float xx = x/((float)i*j);
printf("xx = %e %e %e\n",xx, FLT_MIN, FLT_TRUE_MIN);
Output
xx = 4.751943e-40 1.175494e-38 1.401298e-45
In OP's case, without sub-normal support, xx became 0.0f and led to the undesired output.
Using double math will handle the small intermediate float values.
value = sqrt(x/(1.0*i*j)); // Form product with `double` math
value_inv = 1.0f/value; // Here we can just use float math
printf("value = %e\n",value);
printf("value_inv = %e\n",value_inv);
Output
value = 2.179897e-20
value_inv = 4.587373e+19
On my computer (Ryzen 2700X, x86_64) the results are:
value = 2.179897e-020
value_inv = 4.587373e+019
You can try 1.f instead 1. , which actually is a double:
value_inv = 1.f/value;
Apparently your system hasn't support more digit for float. On my system the output is:
value = 2.179895e-020
value_inv = 4.587376e+019
I got the answer by myself.
I should change sqrt(x/((float)i*j)) to sqrt((double)x/((double)i*j)). After this, I can get correct result:
value = 2.179897e-20
value_inv = 4.587373e+19
There is no reason to use float instead of double for such computations:
3.104924e-33 is a double constant, it gets converted to float upon assignment, with a potential loss of precision
sqrt gets a double argument and returns a double value. Implicit conversions occur again with potential loss of precision.
1. / value computes with the type double because 1. has this type. value gets converted before the division and the result is converted to float to store to value_inv.
value and value_inv are implicitly converted to double when passed to printf.
All these conversions may incur loss of precision or even truncation to 0.. You should instead always use double unless there is a strong requirement to use float:
#include <stdio.h>
#include <math.h>
int main() {
double x = 3.104924e-33;
int i = 6000, j = 1089;
double value, value_inv;
value = sqrt(x / ((double)i * j));
value_inv = 1. / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
return 0;
}
If for some reason you are required to use float, be careful to avoid unneeded conversions:
#include <stdio.h>
#include <math.h>
int main() {
float x = 3.104924e-33F;
int i = 6000, j = 1089;
float value, value_inv;
value = sqrtf(x / ((float)i * j));
value_inv = 1.F / value;
printf("value = %e\n", value);
printf("value_inv = %e\n", value_inv);
return 0;
}

C program not adding float correctly

I have a method that looks like this:
float * mutate(float* organism){
int i;
float sign = 1;
static float newOrg[INPUTS] = {0};
for (i = 0;i<INPUTS;i++){
if (rand() % 2 == 0) {
sign = 1;
} else {
sign = -1;
}
float temp = (organism[i] + sign);
printf("bf: %f af: %f diff: %f sign: %f sign2: %f temp: %f\n\n",
organism[i], (organism[i] + sign), (organism[i] + sign)-organism[i],
sign, sign+sign, temp);
newOrg[i] = organism[i] + sign;
}
return newOrg;
}
When sign is not 0 the first two "%f"s are the same and the 3rd is 0, also putting the sum in a variable didn't help. This is baffling me! I can post full code if needed.
Output:
bf: 117810016.000000 af: 117810016.000000 diff: 0.000000 sign: 1.000000 sign2: 2.000000 temp: 117810016.000000
Finite precision of float.
A typical float can only represent about 232 different numbers. 117,810,016.0 and 1.0 are two of them. 117,810,017.0 is not. So the C sum of 117810016.0 + 1.0 results in the "best" answer of 117810016.0.
Using a higher precision type like double often will extend the range of +1 exact math, but even that will not be exact with large enough values (typically about 9.0*10e15 or 253).
If code is to retain using float, suggest limiting organism[i] to values to the inclusive range or ±8,388,608.0 (223).
Perhaps can code simply use integer types for this task like long long.

Resources