Calculations with long double in clang

Calculations with long double in clang – Compiler bug? - c

Is this a bug in clang?
This prints out the maximum double value:
long double a = DBL_MAX;
printf("%Lf\n", a);
It is:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
This prints out the maximum long double value:
long double a = LDBL_MAX;
printf("%Lf\n", a);
It is:
/* … bigger, but not displayed here. For a good reason. ;-) */
This is quite clear.
But when I use an arithmetic expression, that is compile time computable as an initializer, I get a surprising result:
long double a = 1.L + DBL_MAX + 1.L;
printf("%Lf\n", a);
This still prints out DBL_MAX and not DBL_MAX + 2!?
It is the same, if the computation is done at runtime:
long double b = 2.L;
long double a = DBL_MAX;
printf("%Lf\n", a+b);
Still DBL_MAX.
$ clang --version
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix

Not a bug. long double in clang/x86_64 has 64 bits of precision, and results are rounded to fit in that format.
This will all be clearer if we use hex instead of binary. DBL_MAX is:
0xfffffffffffff800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
The exact mathematical result of 1.L + DBL_MAX is therefore:
0xfffffffffffff800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
... but that is not representable as a long double, so the computed result is rounded to the closest representable long double, which is just DBL_MAX; adding 1 does not (and should not) change the value.
(It rounds down instead of up because the next larger representable number is
0xfffffffffffff801000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
which is much farther away from the mathematically precise result than DBL_MAX is).

The IEE754 floating-point double has mantissa of 53 bits wide (52 physical + 1 implicit bit). That means that double can accurately represent contiguous integers in -2^53...+2^53 range (i.e. from -9007199254740992 to +9007199254740992). After that, the type can no longer represent contiguous integers precisely. Instead, the type can represent only even integer values. Any odd value will be rounded to an adjacent even value in accordance with some implementation-specific rules. So, it is perfectly expected that adding 1 to 9007199254740992 within double might result in nothing due to rounding. Starting from that limit you'll have to add at least 2 to see the change in the value (until you reach the point where adding 2 will cease to have any effect either and you'll have to add at least 4, and so on).
The same logic applies to long double, if it is larger than double on your platform. On x86 long double might refer to hardware 80-bit floating-point type with 64-bit mantissa. It means that even with that type your range for precise representation of contiguous integers is limited to a mere -2^64...+2^64.
The value of DBL_MAX is far, FAR, FAAAAR! outside that range. Which means that trying to add 1 to DBL_MAX will not have any effect on the value. Adding 2 will not have any effect either. Neither will 4, nor 1024, nor even 4294967296. You have to add something in 2^960 area (actually nextafter(2^959)) in order to make an impact on a DBL_MAX value stored in a 80-bit long double format.

This is expected behavior.
long double a = 1.L + DBL_MAX + 1.L;
The long double type is floating point: it has a finite amount of precision. The result of most operations is rounded to the nearest representable value.
See What Every Programmer Should Know About Floating-Point Arithmetic.

A not quite technically correct answer that hopefully helps:
The number is represented by a sign, an exponent, and a fraction.
On this page, information about the C data types is given (https://en.wikipedia.org/wiki/C_data_types). The chart claims that long double is not guaranteed to be a "larger" data type than double; however, since C99 this is guaranteed if it exists on the target architecture (Annex F IEC 60559 floating-point arithmetic). Your results from DBL_MAX and LDBL_MAX show that on your implementation it does in fact use more bits.
So here's what's happening:
you have a number in the following format:
in double that would be
<1 bit><11 bits><52 bits>
in long, you have this 80 bit representation (https://en.wikipedia.org/wiki/Extended_precision)
<1 bit><15 bits><64 bits>
You can fit the double type into the long double type so this causes no problems. However, notice that the decimal point is "floating" (hence the name) not all digits in the number are represented. The computer represents the most significant digits, and then and exponent (so it would be like me writing 1234567 E 234 for example, notice that I'm not writing all 234 digits of that number). When you try to add 1 to this, the digit in the one's place is not being represented (due to the size of the exponent), so this will be ignored after rounding.
For more details, read up on floating point here (https://en.wikipedia.org/wiki/Double_precision_floating-point_format)

Related

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.

... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.

This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.

The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.

An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

value of variable in c language with 100 digits

So I'm new to c , and I have just learned about data type, what confuse me is that a value range of a double for example is from 2.3E-308 to 1.7E+308
mathematically a number of 100 digits ∈ [2.3E-308 , 1.7E+308].
Writing this simple program
#include <stdio.h>
int main()
{
double c = 5416751717547457918597197587615765157415671579185765176547645735175197857989185791857948797847984848;
printf("%le",c);
return 0;
}
the result is 7.531214e+18 by changing %le by %lf th result is 7531214226330737664.000000
which doesn't equal c.
So whats is the problem.

This long number is actually a numerical literal of type long long. But since this type cannot contain such a long number, it is truncated modulo (LLONG_MAX + 1) and resulting in 7531214226330737360.
Demo.
Edit:
#JohnBollinger: ... and then converted to double, with a resulting loss of a few (binary) digits of precision.
#rici: Demo2 - here the constant is of type double because of added decimal point

It might seem that, if we can store a number of up to 10 to the power 308, we are storing 308 digits or so but, in floating point arithmetic, that isn't the case. Floating point numbers are not stored as huge strings of digits.
Broadly, a floating-point number is stored as a mantissa -- typically a number between zero and one -- and an exponent -- some number raised to the power of some other number. The different kinds of floating point number (float, double, long double) each has a different number of bits allocated to the mantissa and exponent. These bit counts, particularly in the mantissa, control the precision with which the number can be represented.
A double on most platforms gives 16-17 decimal digits of precision, regardless of the magnitude (power of ten). It's possible to use libraries that will do arithmetic to any degree of precision required, although such features are not built into C.
An additional complication is that, in your example, the number you assign to c is not actually defined to be a floating point number at all. Lacking any indication that it should be so represented, the compiler will treat it as an integer and, as it's too large to fit even the largest integer type on most platforms, it gets truncated down to integer range.

You should get a proper compiler or enable warnings on it. A recent GCC, with just default settings will output the following warning:
% gcc float.c
float.c: In function ‘main’:
float.c:4:12: warning: integer constant is too large for its type
double c = 5416751717547457918597197587615765157415671579185765176547645735175197857989185791857948797847984848;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Notice that it says integer, i.e. a whole number, not floating point. In C a constant of that form denotes an integer. Unless suffixed with U, it is additionally a signed integer, of the greatest type that it fits. However, neither standard C, nor common implementations, have a type that is big enough to fit this value. So what happens, is [(C11 6.4.4.1p6)[http://port70.net/~nsz/c/c11/n1570.html#6.4.4.1p6]) :
If an integer constant cannot be represented by any type in its list and has no extended integer type, then the integer constant has no type.
Use of such an integer constant without type in arithmetic leads to undefined behaviour, that is the whole execution of the program is now meaningless. You should have read the warnings.
The "fix" would have been to add a . after the number!
#include <stdio.h>
int main(void)
{
double c = 54167517175474579185971975876157651574156715791\
85765176547645735175197857989185791857948797847984848.;
printf("%le\n",c);
}
And running it:
% ./a.out
5.416752e+99
Notice that even then, a double is precise to average ~15 significant decimal digits only.

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.

There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.

#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Why is my float division off by 0.00390625?

float a=67107842,b=512;
float c=a/b;
printf("%lf\n",c);
Why is c 131070.000000 instead of the correct value 131070.00390625?

Your compiler's float type is probably using the 32-bit IEEE 754 single-precision format.
67107842 is a 26-bit binary number:
11111111111111110000000010
The single-precision format represents most numbers as 1.x multipled by some (positive or negative) power of two, where 23 bits are stored after the binary place, with the leading 1. being implied (very small numbers are an exception).
But 67107842 would require 24 bits after the binary place (to be represented as 1.111111111111111000000001 multipled by 225). As there is only room to store 23 bits, the final 1 gets lost. So it is the value in a that is wrong in this case, not the division - a actually contains 67107840 (11111111111111110000000000), which is exactly 131070 * 512.
You can see this if you print a as well:
printf("%lf %lf %lf\n", a, b, c);
gives
67107840.000000 512.000000 131070.000000

Try changing a and c to be type "double", rather than float. That will give you better precision / accuracy. (Floats have about 6 or so significant digits; doubles have more than twice that.)

A float typically uses 32bit IEEE-754 single precision representation, and is good for only approximately 6 significant decimal figures. A double is good for 15, and where supported an 80 bit long double gets to 20 significant figures.
Note that on some compilers there is no distinction between double and long double, or even no support for long double at all.
One solution is to use an arbitrary-precision numeric library, or to use a decimal-floating point library rather then the built-in binary floating point support. Decimal floating point is not intrinsically more precise (though often such libraries support larger, more precise types), but will not show up the artefacts that occur when displaying a decimal representation of a binary floating point value. Decimal floating point is also likely to be much slower, since it is not typically implemented in hardware.

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.

All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.

If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.

I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight