why float-pointing numbers don't overflow to infinity

why float-pointing numbers don't overflow to infinity - c

Let's say we have the following code:
float f = 999999999999*9999999999999999*999999999999; //a large number to make it overflow
so according to the float-pointing rule, the result should be infinity as:
But I checked the result's bit representation , it is not the infinity, it is some else, how come?

None of the values in 999999999999*9999999999999999*999999999999 are floating points, so you aren't doing floating point arithmetic. In fact, even without any extra warning options set, gcc gives me this warning for your code (clang gives a similar warning):
warning: integer overflow in expression of type 'long int' results in '4467987020393345025' [-Woverflow]
Do this instead
float f = 999999999999.0f*9999999999999999*999999999999;
printf("%f\n", f);
output:
inf
Making the first literal a floating point number forces floating point arithmetic, so you get the infinite value you want.

Related

When does a float constant overflow if it is implicitly converted to int type

I have two code snippets and both produce different results. I am using TDM-GCC 4.9.2 compiler and my compiler is 32-bit version
( Size of int is 4 bytes and Minimum value in float is -3.4e38 )
Code 1:
int x;
x=2.999999999999999; // 15 '9s' after decimal point
printf("%d",x);
Output:
2
Code 2:
int x;
x=2.9999999999999999; // 16 '9s' after decimal point
printf("%d",x);
Output:
3
Why is the implicit conversion different in these cases?
Is it due to some overflow in the Real constant specified and if so how does it happen?

(Restricting this answer to IEEE754).
When you assign a constant to a floating point, the IEEE754 standard requires the closest possible floating point number to be picked. Both the numbers you present cannot be represented exactly.
The nearest IEEE754 double precision floating point number to 2.999999999999999
is 2.99999999999999911182158029987476766109466552734375 whereas the nearest one to 2.9999999999999999 is 3.
Hence the output. Converting to an integral type truncates the value towards zero.
Using round is one way to obviate this effect.
Further reading: Is floating point math broken?

Dodging the inaccuracy of a floating point number

I totally understand the problems associated with floating points, but I have seen a very interesting behavior that I can't explain.
float x = 1028.25478;
long int y = 102825478;
float z = y/(float)100000.0;
printf("x = %f ", x);
printf("z = %f",z);
The output is:
x = 1028.254761 z = 1028.254780
Now if floating numbers failed to represent that specific random value (1028.25478) when I assigned that to variable x. Why isn't it the same in case of variable z?
P.S. I'm using pellesC IDE to test the code (C11 compiler).

I am pretty sure that what happens here is that the latter floating point variable is elided and instead kept in a double-precision register; and then passed as is as an argument to printf. Then the compiler will believe that it is safe to pass this number at double precision after default argument promotions.
I managed to produce a similar result using GCC 7.2.0, with these switches:
-Wall -Werror -ffast-math -m32 -funsafe-math-optimizations -fexcess-precision=fast -O3
The output is
x = 1028.254761 z = 1028.254800
The number is slightly different there^.
The description for -fexcess-precision=fast says:
-fexcess-precision=style
This option allows further control over excess precision on
machines where floating-point operations occur in a format with
more precision or range than the IEEE standard and interchange
floating-point types. By default, -fexcess-precision=fast is in
effect; this means that operations may be carried out in a wider
precision than the types specified in the source if that would
result in faster code, and it is unpredictable when rounding to
the types specified in the source code takes place. When
compiling C, if -fexcess-precision=standard is specified then
excess precision follows the rules specified in ISO C99; in
particular, both casts and assignments cause values to be rounded
to their semantic types (whereas -ffloat-store only affects
assignments). This option [-fexcess-precision=standard] is enabled by default for C if a
strict conformance option such as -std=c99 is used. -ffast-math
enables -fexcess-precision=fast by default regardless of whether
a strict conformance option is used.
This behaviour isn't C11-compliant

Restricting this to IEEE754 strict floating point, the answers should be the same.
1028.25478 is actually 1028.2547607421875. That accounts for x.
In the evaluation of y / (float)100000.0;, y is converted to a float, by C's rules of argument promotion. The closest float to 102825478 is 102825480. IEEE754 requires the returning of the the best result of a division, which should be 1028.2547607421875 (the value of z): the closest number to 1028.25480.
So my answer is at odds with your observed behaviour. I put that down to your compiler not implementing floating point strictly; or perhaps not implementing IEEE754.

Code acts as if z was a double and y/(float)100000.0 is y/100000.0.
float x = 1028.25478;
long int y = 102825478;
double z = y/100000.0;
// output
x = 1028.254761 z = 1028.254780
An important consideration is FLT_EVAL_METHOD. This allows select floating point code to evaluate at higher precision.
#include <float.h>
#include <stdio.h>
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
Except for assignment and cast ..., the values yielded by operators with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of FLT_EVAL_METHOD.
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate ... type float and double to the
range and precision of the double type, evaluate long double
... to the range and precision of the long double
type;
2 evaluate all ... to the range and precision of the
long double type.
Yet this does not apply as z with float z = y/(float)100000.0; should lose all higher precision on the assignment.
I agree with #Antti Haapala that code is using a speed optimization that has less adherence to the expected rules of floating point math.

value of variable in c language with 100 digits

So I'm new to c , and I have just learned about data type, what confuse me is that a value range of a double for example is from 2.3E-308 to 1.7E+308
mathematically a number of 100 digits ∈ [2.3E-308 , 1.7E+308].
Writing this simple program
#include <stdio.h>
int main()
{
double c = 5416751717547457918597197587615765157415671579185765176547645735175197857989185791857948797847984848;
printf("%le",c);
return 0;
}
the result is 7.531214e+18 by changing %le by %lf th result is 7531214226330737664.000000
which doesn't equal c.
So whats is the problem.

This long number is actually a numerical literal of type long long. But since this type cannot contain such a long number, it is truncated modulo (LLONG_MAX + 1) and resulting in 7531214226330737360.
Demo.
Edit:
#JohnBollinger: ... and then converted to double, with a resulting loss of a few (binary) digits of precision.
#rici: Demo2 - here the constant is of type double because of added decimal point

It might seem that, if we can store a number of up to 10 to the power 308, we are storing 308 digits or so but, in floating point arithmetic, that isn't the case. Floating point numbers are not stored as huge strings of digits.
Broadly, a floating-point number is stored as a mantissa -- typically a number between zero and one -- and an exponent -- some number raised to the power of some other number. The different kinds of floating point number (float, double, long double) each has a different number of bits allocated to the mantissa and exponent. These bit counts, particularly in the mantissa, control the precision with which the number can be represented.
A double on most platforms gives 16-17 decimal digits of precision, regardless of the magnitude (power of ten). It's possible to use libraries that will do arithmetic to any degree of precision required, although such features are not built into C.
An additional complication is that, in your example, the number you assign to c is not actually defined to be a floating point number at all. Lacking any indication that it should be so represented, the compiler will treat it as an integer and, as it's too large to fit even the largest integer type on most platforms, it gets truncated down to integer range.

You should get a proper compiler or enable warnings on it. A recent GCC, with just default settings will output the following warning:
% gcc float.c
float.c: In function ‘main’:
float.c:4:12: warning: integer constant is too large for its type
double c = 5416751717547457918597197587615765157415671579185765176547645735175197857989185791857948797847984848;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Notice that it says integer, i.e. a whole number, not floating point. In C a constant of that form denotes an integer. Unless suffixed with U, it is additionally a signed integer, of the greatest type that it fits. However, neither standard C, nor common implementations, have a type that is big enough to fit this value. So what happens, is [(C11 6.4.4.1p6)[http://port70.net/~nsz/c/c11/n1570.html#6.4.4.1p6]) :
If an integer constant cannot be represented by any type in its list and has no extended integer type, then the integer constant has no type.
Use of such an integer constant without type in arithmetic leads to undefined behaviour, that is the whole execution of the program is now meaningless. You should have read the warnings.
The "fix" would have been to add a . after the number!
#include <stdio.h>
int main(void)
{
double c = 54167517175474579185971975876157651574156715791\
85765176547645735175197857989185791857948797847984848.;
printf("%le\n",c);
}
And running it:
% ./a.out
5.416752e+99
Notice that even then, a double is precise to average ~15 significant decimal digits only.

Float overflowed but not when a printf argument in C?

Why is it that when my overflow calculation is an argument of the printf() function,the float does not overflow, but when the coded calculation is assigned to a separate variable ,float_overflowed, and is not an argument of the printf function I get the expected result of 'inf'?
Why does this happen? What is causing this difference?
The code and results that led me to this question are below.
Here is my code that didn't execute as expected when the calculation is an argument:
float float_overflow;
float_overflow=3.4e38;
printf("This demonstrates floating data type overflow. We should get an \'inf\' value.\n%e*10=%e.\n\n",float_overflow, float_overflow*10); //No overflow?
The result:
This demonstrates floating data type overflow. We should get an 'inf' value.
3.400000e+38*10=3.400000e+39.
And, when the calculation is not an argument:
float float_upperlimit;
float float_overflowed;
float_upperlimit=3.4e38;
float_overflowed=float_upperlimit*10;
printf("This demonstrates floating data type overflow. We should get an \'inf\' value.\n%e*10=%e.\n\n",float_upperlimit, float_overflowed); //for float overflow
and its result:
This demonstrates floating data type overflow. We should get an 'inf' value.
3.400000e+38*10=inf.

Actually the compiler is not constrained to do the arithmetic in float but it might well use double. 5.2.4.2.1 of the current C standard has:
Except for assignment and cast (which remove all extra range and
precision), the values yielded by operators with floating operands and
values subject to the usual arithmetic conversions and of floating
constants are evaluated to a format whose range and precision may be
greater than required by the type. The use of evaluation formats is
characterized by the implementation-defined value of FLT_EVAL_METHOD
So you only know to force the value to be float when you assign it. Since in the context of the printf call (it is a va_arg function) any such argument is needed as double anyhow, there is no conversion taking place in case that FLT_EVAL_METHOD is of value 1, that is all float arithmetic is done in double.

Remember that for the "%e" format (and all other floating-point formating codes), the argument is actually a double. See e.g. the table in this reference.
That means that when you do the calculation "in-line" as the argument you do now actually overflow. But when you do it for the variable, then it's indeed overflowed and that will carry when used in the printf call.

Problems casting NAN floats to int

Ignoring why I would want to do this, the 754 IEEE fp standard doesn't define the behavior for the following:
float h = NAN;
printf("%x %d\n", (int)h, (int)h);
Gives: 80000000 -2147483648
Basically, regardless of what value of NAN I give, it outputs 80000000 (hex) or -2147483648 (dec). Is there a reason for this and/or is this correct behavior? If so, how come?
The way I'm giving it different values of NaN are here:
How can I manually set the bit value of a float that equates to NaN?
So basically, are there cases where the payload of the NaN affects the output of the cast?
Thanks!

The result of a cast of a floating point number to an integer is undefined/unspecified for values not in the range of the integer variable (±1 for truncation).
Clause 6.3.1.4:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
If the implementation defines __STDC_IEC_559__, then for conversions from a floating-point type to an integer type other than _BOOL:
if the floating value is infinite or NaN or if the integral part of the floating value exceeds the range of the integer type, then the "invalid" floating-
point exception is raised and the resulting value is unspecified.
(Annex F [normative], point 4.)
If the implementation doesn't define __STDC_IEC_559__, then all bets are off.

There is a reason for this behavior, but it is not something you should usually rely on.
As you note, IEEE-754 does not specify what happens when you convert a floating-point NaN to an integer, except that it should raise an invalid operation exception, which your compiler probably ignores. The C standard says the behavior is undefined, which means not only do you not know what integer result you will get, you do not know what your program will do at all; the standard allows the program to abort or get crazy results or do anything. You probably executed this program on an Intel processor, and your compiler probably did the conversion using one of the built-in instructions. Intel specifies instruction behavior very carefully, and the behavior for converting a floating-point NaN to a 32-bit integer is to return 0x80000000, regardless of the payload of the NaN, which is what you observed.
Because Intel specifies the instruction behavior, you can rely on it if you know the instruction used. However, since the compiler does not provide such guarantees to you, you cannot rely on this instruction being used.

First, a NAN is everything not considered a float number according to the IEEE standard.
So it can be several things. In the compiler I work with there is NAN and -NAN, so it's not about only one value.
Second, every compiler has its isnan set of functions to test for this case, so the programmer doesn't have to deal with the bits himself. To summarize, I don't think peeking at the value makes any difference. You might peek the value to see its IEEE construction, like sign, mantissa and exponent, but, again, each compiler gives its own functions (or better say, library) to deal with it.
I do have more to say about your testing, however.
float h = NAN;
printf("%x %d\n", (int)h, (int)h);
The casting you did trucates the float for converting it to an int. If you want to get the
integer represented by the float, do the following
printf("%x %d\n", *(int *)&h, *(int *)&h);
That is, you take the address of the float, then refer to it as a pointer to int, and eventually take the int value. This way the bit representation is preserved.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

why float-pointing numbers don't overflow to infinity - c

Let's say we have the following code: float f = 9999999999999999999999999999999999999999; //a large number to make it overflow so according to the float-pointing rule, the result should be infinity as: But I checked the result's bit representation , it is not the infinity, it is some else, how come?

Related

When does a float constant overflow if it is implicitly converted to int type

Dodging the inaccuracy of a floating point number

value of variable in c language with 100 digits

Float overflowed but not when a printf argument in C?

Problems casting NAN floats to int

Categories

Resources