I am currently trying to debug an uninitialized memory error. I have now come across the numerical literal 0.e0f in the OpenBlas source code (which is what the debugger is currently at) what does that mean?
The context is this:
if ((alpha_r == 0.e0f) && (alpha_i == 0.e0f)) return;
The 0.e0f evaluates to 0 apparently.
A floating-point literals have two syntaxes. The first one consists of the following parts:
nonempty sequence of decimal digits containing a decimal point character (defines significand)
(optional) e or E followed with optional minus or plus sign and nonempty sequence of decimal digits (defines exponent)
(optional) a suffix type specifier as a l, f, L or F
The second one consists of the following parts:
nonempty sequence of decimal digits (defines significant)
e or E followed with optional minus or plus sign and nonempty sequence of decimal digits (defines exponent)
(optional) a suffix type specifier as a l, f, L or F
The suffix type specifier defines the actual type of the floating-point literal:
(no suffix) defines double
f F defines float
l L defines long double
f floating point indicator
eX is exponent value for 10 to the power of X. for example e5 means 10 to the 5 which is 100000. or e-3 means 0.001.
Combining the two
1.23e-3f --> 1.23 x 10 ^ -3 = 0.00123
By expanding your example it is
0.e0f --> 0.0 x 10 ^ 0 = 0.0 (in floating point format)
PS: A useful programming practice. Never ever (see PS2) compare two floating point numbers for equality.
There are some values cannot be represented exactly by floating points just approximations.
Like this example
0.3 + 0.6 = 0.89999999999999991 != 0.9
Instead use:
float a;
float b;
....
if( abs(a-b) < FLT_EPSILON )
FLT_EPLISON is a very small floating point value where 1.0 + FLT_EPSILON != 1.0. (example 0.1e-10) (Quoted from #AlterMann)
abs is short for absolute function for floating points; which in std is fabs. Can be found in other libraries with different names too.
PS 2: OK... Never ever had been a little strong. I didn't mean that this piece of code is wrong neither as algorithm nor syntax. Question itself is a bit simple though this warning may help the new programmers end up here.
As in comments stated this code example is appropriate for checking. 0 check is a valid operation, because current standards for floating point representation guarantee that 0 can be expressed.
It's zero in scientific notation, as single-precision floating point, with a redundant decimal mark. It's the same thing as 0e0.
That's a 0 of type float.
See 6.4.4.2 in the C99 Standard ( http://port70.net/~nsz/c/c99/n1256.html#6.4.4.2 )
In my opinion a plain 0 would be better irrespective of the type of alpha_r and alpha_i
if ((alpha_r == 0) && (alpha_i == 0)) return;
Related
I have recently taken a lecture of System Programming, and my professor told me that f == (float)(double) f is which wrong that I cannot get.
I know that double type loses its data when converted to float, but I believe the loss happens only if the stored number in double type cannot be expressed in float type.
Shouldn't it be true as same as x == (int)(double)x; is true?
the picture is the way I'm understanding it
I'm so sorry that I didn't make my question clearly.
the question is not about declaration, but about double type conversion.
I hope you don't lose your precious time because of my fault.
Assuming IEC 60559, the result of f == (float)(double) f depends on the type of f.
Further assuming f is a float, then there's nothing "wrong" about the expression - it will evaluate to true (unless f held NaN, in which case the expression will evaluate to false).
On the other hand, x == (int)(double)x (assuming x is a int) is (potentially) problematic, since a double precision IEC 60559 floating point value only has 53 bits for the significand1, which cannot represent all possible values of an int if it uses more than 53 bits for its value on your platform (admittedly rare). So it will evaluate to true on platforms where ints are 32-bit (using 31 bits for the value), and might evaluate to false on platforms where ints are 64-bit (using 63 bits for the value) (depending on the value).
Relevant quotes from the C standard (6.3.1.4 and 6.3.1.5) :
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged.
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
When a value of real floating type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged.
1 a double precision IEC 60559 floating point value consists of 1 bit for the sign, 11 bits for the exponent, and 53 bits for the significand (of which 1 is implied and not stored) - totaling 64 (stored) bits.
Taking the question as posed in the title literally,
Why is the statement “f == (float)(double)f;” wrong?
the statement is "wrong" not in any way related to the representation of floating point values but because it is trivially optimized away by any compiler and thus you might as well have saved the electrons used to store it. It is exactly equivalent to the statement
1;
or, if you like, to the statement (from the original question)
x == (int)(double)x;
(which has exactly the same effect as that in the title, regardless of the available precision of the types int, float, and double, i.e. none whatsoever).
Programming being somewhat concerned with precision you should perhaps take note of the difference between a statement and an expression. An expression has a value which might be true or false or something else, but when you add a semicolon (as you did in the question) it becomes a statement (as you called it in the question) and in the absence of side effects the compiler is free to throw it away.
NaNs are retained through float => double => float, but they not equal themselves.
#include <math.h>
#include <stdio.h>
int main(void) {
float f = HUGE_VALF;
printf("%d\n", f == (float)(double) f);
f = NAN;
printf("%d\n", f == (float)(double) f);
printf("%d\n", f == f);
}
Prints
1
0
0
In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.
I would like to understand from when format specifier %g for double starts printing values in exponential format.
myTest.c
#include <stdio.h>
int main() {
double val = 384615.38462;
double val2 = 9999999;
printf ("val = %g\n",val);
printf ("val2 = %g\n",val2);
return 0;
}
Compiled using gcc :
gcc version 4.5.2 (GCC)
Target: i386-pc-solaris2.11
Output :
val = 384615
val2 = 1e+07
Question:
Why val is printed as integer and why val2 has been converted to exponential format even when I have not used %lf in printf.
Is there a range from when it will start print values using exponential format? If yes, Is there any way we can guess what could be the value range ?
Thanks in Advance.
According to the man 3 printf:
g, G
The double argument is converted in style f or e (or F or E for G conversions). The precision specifies the number of significant digits. If the precision is missing, 6 digits are given; if the precision is zero, it is treated as 1. Style e is used if the exponent from its conversion is less than -4 or greater than or equal to the precision. Trailing zeros are removed from the fractional part of the result; a decimal point appears only if it is followed by at least one digit.
And the C11 – ISO/IEC 9899:2011 standard draft N1570 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf):
g,G
A double argument representing a floating-point number is converted in
style f or e (or in style F or E in the case of a G conversion specifier),
depending on the value converted and the precision. Let P equal the
precision if nonzero, 6 if the precision is omitted, or 1 if the precision is zero.
Then, if a conversion with style E would have an exponent of X:
— if P > X ≥ −4, the conversion is with style f (or F) and precision
P − (X + 1).
— otherwise, the conversion is with style e (or E) and precision P − 1.
Finally, unless the # flag is used, any trailing zeros are removed from the
fractional portion of the result and the decimal-point character is removed if
there is no fractional portion remaining.
A double argument representing an infinity or NaN is converted in the style
of an f or F conversion specifier.
The man says
g, G
The double argument is converted in style f or e (or F or E for G conversions). The precision specifies the number of significant digits. If the precision is missing, 6 digits are given; if the precision is zero, it is treated as 1. Style e is used if the exponent from its conversion is less than -4 or greater than or equal to the precision. Trailing zeros are removed from the fractional part of the result; a decimal point appears only if it is followed by at least one digit.
Will the following code, with nothing in between the lines, always produce a value of true for the boolean b?
double d = 0.0;
bool b = (d == 0.0);
I'm using g++ version 4.8.1.
Assuming IEEE-754 (and probably most floating point representations), this is correct as 0.0 is representable exactly in all IEEE-754 formats.
Now if we take another literal that is not representable exactly in IEEE-754 binary formats, like 0.1:
double d = 0.1;
bool b = (d == 0.1);
This may result in false value in b object!
The implementation has the right to use for example a double precision for d and a greater precision for the comparison with the literal.
(C99, 5.2.4.2.2p8) "Except for assignment and cast (which remove all extra range and precision), the values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type."
I'm trying to printf a __float128 using libquadmath, eg:
quadmath_snprintf(s, sizeof(s), "%.30Qg", f);
With the following three constaints:
The output must match the following production:
number = [ minus ] int [ frac ] [ exp ]
decimal-point = %x2E ; .
digit1-9 = %x31-39 ; 1-9
e = %x65 / %x45 ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = %x2D ; -
plus = %x2B ; +
zero = %x30 ; 0
Given any input __float128 "i" that has been printfed to a string matching the above production "s" and and then "s" is scanfed back into a __float128 "j" - "i" must be bitwise identical to "j" - ie no information should be lost. For at least some values this is not possible (NaN, Infinity), what is the complete list of those values?
There should be no other string satisfying the above two criteria, that is shorter than the candidate.
Is there a quadmath_snprintf format string that satisfies the above (1, 3 and 2 when possible)? If so what is it?
What are the values of __float128 that cannot be represented accurately enough to satisfy point 2 by the above production? (eg Nan, +/-Infinity, etc) How do I detect if a __float128 is holding one of these values?
If you're on x86, then the GCC __float128 type is a software implementation of the IEEE 754-2008 binary128 format. The IEEE 754 standard requires that a binary -> char -> binary roundtrip recovers the original value if the character representation contains 36 significant (decimal) digits. Thus the format string %.36Qg ought to do it.
It is not required that a NaN roundtrip recover the original bitwise value.
As for your requirement #3, libquadmath does not contain code for this kind of "shortest representation" formatting, e.g. in the spirit of the Steele & White paper or the code by David Gay.
My intuition tells me that binary fraction 0.1111...1 (128 ones); also equal to 1-1/2**128 will produce the largest number of overflows upon conversion to decimal. Convert that value to decimal (I don't have a bignum package right now), count the number of digits, add 2-3 on top of that and you should be safe. I don't have a mathematical proof that this is enough, though.
If precision of I/O is important, I'd prefer outputting the float as a hex string. Accurate floating-point IO is hard to get right, and the library might be buggy in that respect.