C - Long double min and max value [duplicate] - c

I'm working with C, I have to do an exercise in which I have to print the value of long double min and long double max.
I used float.h as header, but these two macros (LDBL_MIN/MAX) give me the same value as if it was just a double.
I'm using Visual Studio 2015 and if I hover the mouse on LDBL MIN it says #define LDBL_MIN DBL_MIN. Is that why it prints dbl_min instead of ldbl_min?
How can I fix this problem?
printf("Type: Long Double Value: %lf Min: %e Max: %e Memory:%lu\n",
val10, LDBL_MIN, LDBL_MAX, longd_size);
It is a problem because my assignment requires two different values for LDBL and DBL.

C does not specify that long double must have a greater precision/range than double.
Even if the implementation treats them as different types, they may have the same implementation, range, precision, min value, max value, etc.
Concerning Visual Studio, MS Long Double helps.
To fix the problem, use another compiler that supports long double with a greater precision/range than double. Perhaps GCC?

From this reference on the lfoating point types:
long double - extended precision floating point type. Matches IEEE-754 extended floating-point type if supported, otherwise matches some non-standard extended floating-point type as long as its precision is better than double and range is at least as good as double, otherwise matches the type double. Some x86 and x86_64 implementations use the 80-bit x87 floating point type.
Added emphasis is mine.
What the above quote says is that while a compliant C compiler must have the long double type, it doesn't really have to support it differently than double. Something which is probably the case with the Visual Studio C compiler.

Those macros are either broken, or long double is just an alias for double on your system. To test, set a long double to DBL_MAX, multiply by two, then subtract DBL_MAX from it. If the result is finite, then you have extra exponent space in the long double. If not, and long double is bigger than double, the extra bytes could just be padding, or you could have the same exponent space and more precision. So LDBL_MAX's genuine value will be just a smidgen over DBL_MAX.
The easiest way to generate the max is simply to look up the binary representation. However if you want to do it in portable C, you can probe it by repeated multiplications to get the magnitude, then fill out the mantissa by repeatedly adding descending powers of two until you run out of precision.

Related

How to use float.h macros to enhance the floating point precision

As I understood from this answer, there is a way to extend the precision using float.h via the macro LDBL_MANT_DIG. My goal is to enhance the floating point precision of double values so that I can store a more accurate number, e.g., 0.000000000566666 instead of 0.000000. Kindly, can someone give a short example of to use this macro so that I can extend the precision stored in the buffer?
Your comment about wanting to store more accurate numbers so you don't get just 0.000000 suggests that the problem is not in the storage but in the way you're printing the numbers. Consider the following code:
#include <stdio.h>
int main(void)
{
float f = 0.000000000566666F;
double d = 0.000000000566666;
long double l = 0.000000000566666L;
printf("%f %16.16f %13.6e\n", f, f, f);
printf("%f %16.16f %13.6e\n", d, d, d);
printf("%lf %16.16lf %13.6le\n", d, d, d);
printf("%Lf %16.16Lf %13.6Le\n", l, l, l);
return 0;
}
When run, it produces:
0.000000 0.0000000005666660 5.666660e-10
0.000000 0.0000000005666660 5.666660e-10
0.000000 0.0000000005666660 5.666660e-10
0.000000 0.0000000005666660 5.666660e-10
As you can see, using the default "%f" format prints 6 decimal places, which treats the value as 0.0. However, as the format with more precision shows, the value is stored correctly and can be displayed with more decimal places, or with the %e format, or indeed with the %g format though the code doesn't show that in use — the output would be the same as the %e format in this example.
The %f conversion specification, as opposed to %lf or %Lf, says 'print a double'. Note that when float values are passed to printf(), they are automatically converted to double (just as numeric types shorter than int are promoted to int). Therefore, %f can be used for both float and double types, and indeed the %lf format (which was defined in C99 — everything else was defined in C90) can be used to format float or double values. The %Lf format expects a long double.
There isn't a way to store more precision in a float or double simply by using any of the macros from <float.h>. Those are more descriptions of the characteristics of the floating-point types and the way that they behave than anything else.
The answer you cited only mentions that the macro is equal to the number of precision digits that you can store. It cannot in any way increase precision. But the macro is for "long doubles", not doubles. You can use the long double type if you need more precision than the double type:
long double x = 3.14L;
Notice the "L" after the number for specifying a long double literal.
Floating-point types are implemented in hardware. The precision is standardized across the industry and baked into the circuits of the CPU. There's no way to increase it beyond long double except an extended-precision software library such as GMP.
The good news is that floating-point numbers don't get bogged down in leading zeroes. 0.000000000566666 won't round to zero. With only six digits, you only even need a single-precision float to represent it well.
There is an issue with math.h (not float.h), where the POSIX standard fails to provide π and e with long double precision. There are a couple workarounds: GNU defines e.g. M_PIl and M_El, or you can also use the preprocessor to paste an l onto such literal constants in another library (giving the number long double type) and hope for spare digits.

Define LDBL_MAX/MIN on C

I'm working with C, I have to do an exercise in which I have to print the value of long double min and long double max.
I used float.h as header, but these two macros (LDBL_MIN/MAX) give me the same value as if it was just a double.
I'm using Visual Studio 2015 and if I hover the mouse on LDBL MIN it says #define LDBL_MIN DBL_MIN. Is that why it prints dbl_min instead of ldbl_min?
How can I fix this problem?
printf("Type: Long Double Value: %lf Min: %e Max: %e Memory:%lu\n",
val10, LDBL_MIN, LDBL_MAX, longd_size);
It is a problem because my assignment requires two different values for LDBL and DBL.
C does not specify that long double must have a greater precision/range than double.
Even if the implementation treats them as different types, they may have the same implementation, range, precision, min value, max value, etc.
Concerning Visual Studio, MS Long Double helps.
To fix the problem, use another compiler that supports long double with a greater precision/range than double. Perhaps GCC?
From this reference on the lfoating point types:
long double - extended precision floating point type. Matches IEEE-754 extended floating-point type if supported, otherwise matches some non-standard extended floating-point type as long as its precision is better than double and range is at least as good as double, otherwise matches the type double. Some x86 and x86_64 implementations use the 80-bit x87 floating point type.
Added emphasis is mine.
What the above quote says is that while a compliant C compiler must have the long double type, it doesn't really have to support it differently than double. Something which is probably the case with the Visual Studio C compiler.
Those macros are either broken, or long double is just an alias for double on your system. To test, set a long double to DBL_MAX, multiply by two, then subtract DBL_MAX from it. If the result is finite, then you have extra exponent space in the long double. If not, and long double is bigger than double, the extra bytes could just be padding, or you could have the same exponent space and more precision. So LDBL_MAX's genuine value will be just a smidgen over DBL_MAX.
The easiest way to generate the max is simply to look up the binary representation. However if you want to do it in portable C, you can probe it by repeated multiplications to get the magnitude, then fill out the mantissa by repeatedly adding descending powers of two until you run out of precision.

Is float better than double sometimes?

I was solving this problem on spoj http://www.spoj.com/problems/ATOMS/. I had to give the integral part of log(m / n) / log(k) as output. I had taken m, n, k as long long. When I was calculating it using long doubles, I was getting a wrong answer, but when I used float, it got accepted.
printf("%lld\n", (long long)(log(m / (long double)n) / log(k)));
This was giving a wrong answer but this:
printf("%lld\n", (long long)((float)log(m / (float)n) / (float)log(k)));
got accepted. So are there situations when float is better than double with respect to precision?
A float is never more accurate than a double since the former must be a subset of the latter, by the C standard:
6.2.5/6: "The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double."
Note that the standard does not insist on a particular floating point representation although IEEE754 is particularly common.
It might be better in some cases in terms of calculation time/space performance. One example that is just on the table in front of me - an ARM Cortex-M4F based microcontroller, having a hardware Floating Point Unit (FPU), capable of working with single-precision arithmetic, but not with double precision, which is giving an incredible boost to floating point calculations.
Try this simple code :
#include<stdio.h>
int main(void)
{
float i=3.3;
if(i==3.3)
printf("Equal\n");
else
printf("Not Equal\n");
return 0;
}
Now try the same with double as a datatype of i.
double will always give you more precision than a float.
With double, you encode the number using 64 bits, while your using only 32 bits with float.
Edit: As Jens mentioned it may not be the case. double will give more precision only if the compiler is using IEEE-754. That's the case of GCC, Clang and MSVC. I haven't yet encountered a compiler which didn't use 32 bits for floats and 64 bits for doubles though...

What is the range of floats as used in C in different environments?

The minimum range of the float datatype is 1E-37 to 1E+37. What is the maximum range of floats?
As for the maximum of maximum floating types, the standard didn't specify them. What the standard specify is "the minimum of maximum".
C11 §5.2.4.2.2 Characteristics of floating types <float.h> Section 12 & 13
The values given in the following list shall be replaced by constant expressions with
implementation-defined values that are greater than or equal to those shown:
— maximum representable finite floating-point number, (1 − b−p)bemax
FLT_MAX 1E+37
DBL_MAX 1E+37
LDBL_MAX 1E+37
FLT_MIN 1E-37
DBL_MIN 1E-37
LDBL_MIN 1E-37
The maximum range, and the range on all real-world implementations that matter, is -INFINITY to +INFINITY. One place it actually comes into play that the "range of representable values: for float includes the infinities (on implementations that support infinities) is that it's a constraint violation for a constant expression to be outside the range of values for its type, but that even something like 1e9999999999999999999999999999 is within the range of values for IEEE single-precision float, since the range is -INFINITY to +INFINITY. There's a defect report/interpretation detailing this issue somewhere, but I don't have the link handy.
There is no general maximum range. The C standard only specifies which range has to be covered at least. A compiler can support any greater range.
There is however a standard for floating types, IEEE 754, specifying the behavior of floating point platforms in detail. This standard is usually applied.
According to that standard, the values are 1.4E-45 and 3.4E38.
There is no maximum range. An implementation is allowed to be arbitrarily generous. The only restrictions placed on floating point implementations are that:
The set of possible values of float is a subset of the set of possible values of double;
The set of possible values of double is a subset of the set of possible values of long double;
The set of all possible values of float (and thus double and long double) must include at least one finite number ≥ 1E37 and one finite number ≤ -1E37;
The set of all possible values of float (and thus double and long double) must include at least one positive number ≤ 1E-37 which is not zero;
The set of all possible values of float must include at least one positive number ≤ 1.0 + 1.0E-5 which is greater than 1.0; while the set of all possible values of double (and thus long double) must include at least one positive number ≤ 1.0 + 1.0E-9 which is greater than 1.0.
The last requirement does not strictly require double to be more precise than float since float could also include the same value.
However, an implementation may define the macro __STDC_IEC_559__. If it does, it needs to promise to implement the IEC-60559 standard (which is, in effect, IEEE-754); this commitment includes requiring float to be precisely the IEC-60559 single precision (32-bit) format, and double to be precisely the IEC-60559 double precision (64-bit) format. long double is not required to be an IEC-60559 format, but it must, as above, be a superset of double (or exactly the same type).

Calculations with long double in clang – Compiler bug?

Is this a bug in clang?
This prints out the maximum double value:
long double a = DBL_MAX;
printf("%Lf\n", a);
It is:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
This prints out the maximum long double value:
long double a = LDBL_MAX;
printf("%Lf\n", a);
It is:
/* … bigger, but not displayed here. For a good reason. ;-) */
This is quite clear.
But when I use an arithmetic expression, that is compile time computable as an initializer, I get a surprising result:
long double a = 1.L + DBL_MAX + 1.L;
printf("%Lf\n", a);
This still prints out DBL_MAX and not DBL_MAX + 2!?
It is the same, if the computation is done at runtime:
long double b = 2.L;
long double a = DBL_MAX;
printf("%Lf\n", a+b);
Still DBL_MAX.
$ clang --version
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
Not a bug. long double in clang/x86_64 has 64 bits of precision, and results are rounded to fit in that format.
This will all be clearer if we use hex instead of binary. DBL_MAX is:
0xfffffffffffff800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
The exact mathematical result of 1.L + DBL_MAX is therefore:
0xfffffffffffff800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
... but that is not representable as a long double, so the computed result is rounded to the closest representable long double, which is just DBL_MAX; adding 1 does not (and should not) change the value.
(It rounds down instead of up because the next larger representable number is
0xfffffffffffff801000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
which is much farther away from the mathematically precise result than DBL_MAX is).
The IEE754 floating-point double has mantissa of 53 bits wide (52 physical + 1 implicit bit). That means that double can accurately represent contiguous integers in -2^53...+2^53 range (i.e. from -9007199254740992 to +9007199254740992). After that, the type can no longer represent contiguous integers precisely. Instead, the type can represent only even integer values. Any odd value will be rounded to an adjacent even value in accordance with some implementation-specific rules. So, it is perfectly expected that adding 1 to 9007199254740992 within double might result in nothing due to rounding. Starting from that limit you'll have to add at least 2 to see the change in the value (until you reach the point where adding 2 will cease to have any effect either and you'll have to add at least 4, and so on).
The same logic applies to long double, if it is larger than double on your platform. On x86 long double might refer to hardware 80-bit floating-point type with 64-bit mantissa. It means that even with that type your range for precise representation of contiguous integers is limited to a mere -2^64...+2^64.
The value of DBL_MAX is far, FAR, FAAAAR! outside that range. Which means that trying to add 1 to DBL_MAX will not have any effect on the value. Adding 2 will not have any effect either. Neither will 4, nor 1024, nor even 4294967296. You have to add something in 2^960 area (actually nextafter(2^959)) in order to make an impact on a DBL_MAX value stored in a 80-bit long double format.
This is expected behavior.
long double a = 1.L + DBL_MAX + 1.L;
The long double type is floating point: it has a finite amount of precision. The result of most operations is rounded to the nearest representable value.
See What Every Programmer Should Know About Floating-Point Arithmetic.
A not quite technically correct answer that hopefully helps:
The number is represented by a sign, an exponent, and a fraction.
On this page, information about the C data types is given (https://en.wikipedia.org/wiki/C_data_types). The chart claims that long double is not guaranteed to be a "larger" data type than double; however, since C99 this is guaranteed if it exists on the target architecture (Annex F IEC 60559 floating-point arithmetic). Your results from DBL_MAX and LDBL_MAX show that on your implementation it does in fact use more bits.
So here's what's happening:
you have a number in the following format:
in double that would be
<1 bit><11 bits><52 bits>
in long, you have this 80 bit representation (https://en.wikipedia.org/wiki/Extended_precision)
<1 bit><15 bits><64 bits>
You can fit the double type into the long double type so this causes no problems. However, notice that the decimal point is "floating" (hence the name) not all digits in the number are represented. The computer represents the most significant digits, and then and exponent (so it would be like me writing 1234567 E 234 for example, notice that I'm not writing all 234 digits of that number). When you try to add 1 to this, the digit in the one's place is not being represented (due to the size of the exponent), so this will be ignored after rounding.
For more details, read up on floating point here (https://en.wikipedia.org/wiki/Double_precision_floating-point_format)

Resources