Weird behaviour on c arithmetics - c

I have this code
#include <stdio.h>
#include <math.h>
static double const x = 665857;
static double const y = 470832;
int main(){
double z = x*x*x*x -y*y*y*y*4 - y*y*4;
printf("%f \n",z);
return 0;
}
The real solution of this is equation is 1. As already answered on a previous question by myself, this code fails because of catastrophic cancellation. However, now I've found an even more strange thing. It works if you use long longs, while, as far as I know, they have less range than doubles. Why?

long long has less range, but more precision than double.
However, that's not what's at work here. Your computation actually exceeds the range of long long as well, but because of the way in which integer overflow is handled on your system, the correct result falls out anyway. (Note that the behavior of signed integer overflow is not pinned down by the C standard, but "usually" behaves as you see here).
If you look instead at the intermediate result x*x*x*x, you will see that if you compute it using double, it has a sensible value; not exact, but rounded and good enough for most purposes. However, if you compute it in long long, you will find a number that appears at first to be absolutely bonkers, due to overflow.

In a double there are bits for the mantissa and exponent. For large doubles the distance between two doubles (same exponent, 1 added to the mantissa) results, is much larger than 1. Hence you are in the same situation as infinity + 1 = infinity.
long long's will overflow, calculate modulo 2”, and hence the result when it should be 1 can indeed be one.

An overflow for a floating point type can either be considered undefined or an error, depending on language and environment. An overflow for an integral type simply wraps around (sometimes still yielding correct results, and sometimes not).

Related

Trying to recreate printf's behaviour with doubles and given precisions (rounding) and have a question about handling big numbers

I'm trying to recreate printf and I'm currently trying to find a way to handle the conversion specifiers that deal with floats. More specifically: I'm trying to round doubles at a specific decimal place. Now I have the following code:
double ft_round(double value, int precision)
{
long long int power;
long long int result;
power = ft_power(10, precision);
result = (long long int) (value * power);
return ((double)result / power);
}
Which works for relatively small numbers (I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story). However, if I try a large number like
-154584942443242549.213565124235
I get -922337203685.4775391 as output, whereas printf itself gives me
-154584942443242560.0000000 (precision for both outputs is 7).
Both aren't exactly the output I was expecting but I'm wondering if you can help me figure out how I can make my idea for rounding work with larger numbers.
My question is basically twofold:
What exactly is happening in this case, both with my code and printf itself, that causes this output? (I'm pretty new to programming, sorry if it's a dumb question)
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
P.S. I know there are libraries and such to do the rounding but I'm looking for a reinventing-the-wheel type of answer here, just FYI!
You can't round to a particular decimal precision with binary floating point arithmetic. It's not just possible. At small magnitudes, the errors are small enough that you can still get the right answer, but in general it doesn't work.
The only way to round a floating point number as decimal is to do all the arithmetic in decimal. Basically you start with the mantissa, converting it to decimal like an integer, then scale it by powers of 2 (the exponent) using decimal arithmetic. The amount of (decimal) precision you need to keep at each step is roughly (just a bit over) the final decimal precision you want. If you want an exact result, though, it's on the order of the base-2 exponent range (i.e. very large).
Typically rather than using base 10, implementations will use a base that's some large power of 10, since it's equivalent to work with but much faster. 1000000000 is a nice base because it fits in 32 bits and lets you treat your decimal representation as an array of 32-bit ints (comparable to how BCD lets you treat decimal representations as arrays of 4-bit nibbles).
My implementation in musl is dense but demonstrates this approach near-optimally and may be informative.
What exactly is happening in this case, both with my code and printf itself, that causes this output?
Overflow. Either ft_power(10, precision) exceeds LLONG_MAX and/or value * power > LLONG_MAX.
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
Set aside various int types to do rounding/truncation. Use FP routines like round(), nearby(), etc.
double ft_round(double value, int precision) {
// Use a re-coded `ft_power()` that computes/returns `double`
double pwr = ft_power(10, precision);
return round(value * pwr)/pwr;
}
As well mentioned in this answer, floating point numbers have binary characteristics as well as finite precision. Using only double will extend the range of acceptable behavior. With extreme precision, the value computed with this code be close yet potentially only near the desired result.
Using temporary wider math will extend the acceptable range.
double ft_round(double value, int precision) {
double pwr = ft_power(10, precision);
return (double) (roundl((long double) value * pwr)/pwr);
}
I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story
See Printf width specifier to maintain precision of floating-point value to print FP with enough precision.

Gaussian integral and double division

For fun, I was trying to evaluate the Gaussian integral from 0 to 1 using a series expansion. For this reason, I wrote a factorial function which works well up to 20!(I checked) and then I wrote this:
int main(){
int n;
long double result=0;
for(n=0; n<=5; n++){
if(n%2==0){
result+=(((long double) 1/(long double)(factorial(n)*(2*n+1))));
} else {
result-=(((long double) 1/(long double)(factorial(n)*(2*n+1))));
}
}
printf("The Gaussian integral from 0 to 1 is %Lf\n", result);
}
This gives me a strange negative number which is obviously not even close. I suspect the problem is with the cast, but I don't know what it is. Any thoughts? This is not the first thing I tried. I tried converting anything in the expression and putting the explicit cast at the beginning, but it didn't work.
You are using the MinGW compiler (port of gcc for Windows), which has issues with the long double type. This is due to conflicts between GCC's implementation of long double and Microsoft's C library. See also this question.
According to this question, defining __USE_MINGW_ANSI_STDIO may solve this. If not, using double instead will work.
In (long double)(factorial(n)*(2*n+1), the multiplications are integer multiplications and the first one could overflow if the result of factorial is already close to the limit of the integer type used.
Write ((long double)(factorial(n))*(2*n+1) so that the first multiplication is a floating-point multiplication.
You're almost certainly overflowing your integer type. In C this is technically undefined behaviour.
For 32 bit unsigned integer, 13! will overflow. On 64 bit, 21! will overflow.
Your algorithm will survive a little longer if you use a floating point double type or an extension like __uint128 (gives you, I think, up to 34!) if your compiler supports it.
Another problem that you have is that you are progressively adding terms of decreasing size to your total. That's never a good idea when working with floating point types. If you run your for loop in the reverse order then the result will be more accurate.

Why doesn't my code work when replacing 622.08E6 with 622080000?

I recently came across a C code (working by the way) where I found
freq_xtal = ((622.08E6 * vcxo_reg_val->hiv * vcxo_reg_val->n1)/(temp_rfreq));
From my intuition it seems that 622.08E6 should mean 622.08 x 106. From this question this assumption is correct.
So I tried replacing 622.08e6 with
uint32_t default_freq = 622080000;
For some reason this doesn't seem to work
Any thoughts or suggestions appreciated
The problem you are having (and I'm speculating here because I don't have the rest of your code) appears to be that replacing the floating point with an integer caused the multiplication and division to be integer based, and not decimal based. As a result, you now compute the wrong value.
Try type casting your uint32_t to a double and see if that clears it up.
The problem is due to overflow!
The original expression (622.08E6 * vcxo_reg_val->hiv * vcxo_reg_val->n1)/temp_rfreq (you have too many unnecessary parentheses though) is done in double precision because 622.08E6 is a double literal. That'll result in a floating-point value
However if you replace the literal with 622080000 then the whole expression will be done in integer math if all the variables are integer. But more importantly, integer math will overflow (at least much sooner than floating-point one)
Notice that UINT32_MAX / 622080000.0 ≈ 6.9. That means just multiply the constant by 7 and it'll overflow. However in the code you multiply 622080000 with 2 other values whose product may well be above 6. You should add the ULL suffix to do the math in unsigned long long
freq_xtal = (622080000ULL * vcxo_reg_val->hiv * vcxo_reg_val->n1)/temp_rfreq;
or change the variable to uint64_t default_freq = 622080000ULL;

Can I calculate error introduced by doubles?

Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.
A decimal representation with a lot of digits is:
1.7320508075688772935274463415058723669428052538103806280558069794519330169088
00037081146186757248575675...
Now, when I calculate \sqrt{3}, I get 1.732051:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
int main() {
double myVar = sqrt (3);
printf("as double:\t%f\n", myVar);
}
According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.
Is there any way I can calculate the error myself?
(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)
Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.
My try
While writing this question, I found this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h> // needed for higher precision
int main() {
long double r = sqrtl(3.0L);
printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
}
With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h>
int main() {
double myVar = sqrt (3);
long double r = sqrtl(3.0L);
long double error = abs(r-myVar) / r;
printf("Double:\t\t%f\n", myVar);
printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
}
But it outputs:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 0
How can I fix that to get the error?
What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.
https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf
printf rounds doubles to 6 places when you use %f without a precision.
e.g.
double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);
My output: -0.00000000000000004445
If the result is 0, your long double and double are the same.
One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.
Frama-C's value analysis can do this for you with option -all-rounding-modes.
double Frama_C_sqrt(double x);
double sqrt(double x)
{
return Frama_C_sqrt(x);
}
double y;
int main(){
y = sqrt(3.0);
}
Analyzing the program with:
frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
y ∈ [1.7320508075688772 .. 1.7320508075688774]
This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].
Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.
You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);
The actual value of double myVar is
1.732050807568877281 //18 digits
so 1.732050807568877281-1.732050807568877281 is zero
According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.
It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.
You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 5.79643049346087304e-17
(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)
Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:
If you are calculating the more accurate long double results, why bother with double?
Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)
In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.
In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.
You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x' – x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.
Is there some class of numbers you want to perform this analysis for?
Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Resources