Is float better than double sometimes? - c

I was solving this problem on spoj http://www.spoj.com/problems/ATOMS/. I had to give the integral part of log(m / n) / log(k) as output. I had taken m, n, k as long long. When I was calculating it using long doubles, I was getting a wrong answer, but when I used float, it got accepted.
printf("%lld\n", (long long)(log(m / (long double)n) / log(k)));
This was giving a wrong answer but this:
printf("%lld\n", (long long)((float)log(m / (float)n) / (float)log(k)));
got accepted. So are there situations when float is better than double with respect to precision?

A float is never more accurate than a double since the former must be a subset of the latter, by the C standard:
6.2.5/6: "The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double."
Note that the standard does not insist on a particular floating point representation although IEEE754 is particularly common.

It might be better in some cases in terms of calculation time/space performance. One example that is just on the table in front of me - an ARM Cortex-M4F based microcontroller, having a hardware Floating Point Unit (FPU), capable of working with single-precision arithmetic, but not with double precision, which is giving an incredible boost to floating point calculations.

Try this simple code :
#include<stdio.h>
int main(void)
{
float i=3.3;
if(i==3.3)
printf("Equal\n");
else
printf("Not Equal\n");
return 0;
}
Now try the same with double as a datatype of i.

double will always give you more precision than a float.
With double, you encode the number using 64 bits, while your using only 32 bits with float.
Edit: As Jens mentioned it may not be the case. double will give more precision only if the compiler is using IEEE-754. That's the case of GCC, Clang and MSVC. I haven't yet encountered a compiler which didn't use 32 bits for floats and 64 bits for doubles though...

Related

Trying to recreate printf's behaviour with doubles and given precisions (rounding) and have a question about handling big numbers

I'm trying to recreate printf and I'm currently trying to find a way to handle the conversion specifiers that deal with floats. More specifically: I'm trying to round doubles at a specific decimal place. Now I have the following code:
double ft_round(double value, int precision)
{
long long int power;
long long int result;
power = ft_power(10, precision);
result = (long long int) (value * power);
return ((double)result / power);
}
Which works for relatively small numbers (I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story). However, if I try a large number like
-154584942443242549.213565124235
I get -922337203685.4775391 as output, whereas printf itself gives me
-154584942443242560.0000000 (precision for both outputs is 7).
Both aren't exactly the output I was expecting but I'm wondering if you can help me figure out how I can make my idea for rounding work with larger numbers.
My question is basically twofold:
What exactly is happening in this case, both with my code and printf itself, that causes this output? (I'm pretty new to programming, sorry if it's a dumb question)
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
P.S. I know there are libraries and such to do the rounding but I'm looking for a reinventing-the-wheel type of answer here, just FYI!
You can't round to a particular decimal precision with binary floating point arithmetic. It's not just possible. At small magnitudes, the errors are small enough that you can still get the right answer, but in general it doesn't work.
The only way to round a floating point number as decimal is to do all the arithmetic in decimal. Basically you start with the mantissa, converting it to decimal like an integer, then scale it by powers of 2 (the exponent) using decimal arithmetic. The amount of (decimal) precision you need to keep at each step is roughly (just a bit over) the final decimal precision you want. If you want an exact result, though, it's on the order of the base-2 exponent range (i.e. very large).
Typically rather than using base 10, implementations will use a base that's some large power of 10, since it's equivalent to work with but much faster. 1000000000 is a nice base because it fits in 32 bits and lets you treat your decimal representation as an array of 32-bit ints (comparable to how BCD lets you treat decimal representations as arrays of 4-bit nibbles).
My implementation in musl is dense but demonstrates this approach near-optimally and may be informative.
What exactly is happening in this case, both with my code and printf itself, that causes this output?
Overflow. Either ft_power(10, precision) exceeds LLONG_MAX and/or value * power > LLONG_MAX.
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
Set aside various int types to do rounding/truncation. Use FP routines like round(), nearby(), etc.
double ft_round(double value, int precision) {
// Use a re-coded `ft_power()` that computes/returns `double`
double pwr = ft_power(10, precision);
return round(value * pwr)/pwr;
}
As well mentioned in this answer, floating point numbers have binary characteristics as well as finite precision. Using only double will extend the range of acceptable behavior. With extreme precision, the value computed with this code be close yet potentially only near the desired result.
Using temporary wider math will extend the acceptable range.
double ft_round(double value, int precision) {
double pwr = ft_power(10, precision);
return (double) (roundl((long double) value * pwr)/pwr);
}
I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story
See Printf width specifier to maintain precision of floating-point value to print FP with enough precision.

C - Long double min and max value [duplicate]

I'm working with C, I have to do an exercise in which I have to print the value of long double min and long double max.
I used float.h as header, but these two macros (LDBL_MIN/MAX) give me the same value as if it was just a double.
I'm using Visual Studio 2015 and if I hover the mouse on LDBL MIN it says #define LDBL_MIN DBL_MIN. Is that why it prints dbl_min instead of ldbl_min?
How can I fix this problem?
printf("Type: Long Double Value: %lf Min: %e Max: %e Memory:%lu\n",
val10, LDBL_MIN, LDBL_MAX, longd_size);
It is a problem because my assignment requires two different values for LDBL and DBL.
C does not specify that long double must have a greater precision/range than double.
Even if the implementation treats them as different types, they may have the same implementation, range, precision, min value, max value, etc.
Concerning Visual Studio, MS Long Double helps.
To fix the problem, use another compiler that supports long double with a greater precision/range than double. Perhaps GCC?
From this reference on the lfoating point types:
long double - extended precision floating point type. Matches IEEE-754 extended floating-point type if supported, otherwise matches some non-standard extended floating-point type as long as its precision is better than double and range is at least as good as double, otherwise matches the type double. Some x86 and x86_64 implementations use the 80-bit x87 floating point type.
Added emphasis is mine.
What the above quote says is that while a compliant C compiler must have the long double type, it doesn't really have to support it differently than double. Something which is probably the case with the Visual Studio C compiler.
Those macros are either broken, or long double is just an alias for double on your system. To test, set a long double to DBL_MAX, multiply by two, then subtract DBL_MAX from it. If the result is finite, then you have extra exponent space in the long double. If not, and long double is bigger than double, the extra bytes could just be padding, or you could have the same exponent space and more precision. So LDBL_MAX's genuine value will be just a smidgen over DBL_MAX.
The easiest way to generate the max is simply to look up the binary representation. However if you want to do it in portable C, you can probe it by repeated multiplications to get the magnitude, then fill out the mantissa by repeatedly adding descending powers of two until you run out of precision.

Define LDBL_MAX/MIN on C

I'm working with C, I have to do an exercise in which I have to print the value of long double min and long double max.
I used float.h as header, but these two macros (LDBL_MIN/MAX) give me the same value as if it was just a double.
I'm using Visual Studio 2015 and if I hover the mouse on LDBL MIN it says #define LDBL_MIN DBL_MIN. Is that why it prints dbl_min instead of ldbl_min?
How can I fix this problem?
printf("Type: Long Double Value: %lf Min: %e Max: %e Memory:%lu\n",
val10, LDBL_MIN, LDBL_MAX, longd_size);
It is a problem because my assignment requires two different values for LDBL and DBL.
C does not specify that long double must have a greater precision/range than double.
Even if the implementation treats them as different types, they may have the same implementation, range, precision, min value, max value, etc.
Concerning Visual Studio, MS Long Double helps.
To fix the problem, use another compiler that supports long double with a greater precision/range than double. Perhaps GCC?
From this reference on the lfoating point types:
long double - extended precision floating point type. Matches IEEE-754 extended floating-point type if supported, otherwise matches some non-standard extended floating-point type as long as its precision is better than double and range is at least as good as double, otherwise matches the type double. Some x86 and x86_64 implementations use the 80-bit x87 floating point type.
Added emphasis is mine.
What the above quote says is that while a compliant C compiler must have the long double type, it doesn't really have to support it differently than double. Something which is probably the case with the Visual Studio C compiler.
Those macros are either broken, or long double is just an alias for double on your system. To test, set a long double to DBL_MAX, multiply by two, then subtract DBL_MAX from it. If the result is finite, then you have extra exponent space in the long double. If not, and long double is bigger than double, the extra bytes could just be padding, or you could have the same exponent space and more precision. So LDBL_MAX's genuine value will be just a smidgen over DBL_MAX.
The easiest way to generate the max is simply to look up the binary representation. However if you want to do it in portable C, you can probe it by repeated multiplications to get the magnitude, then fill out the mantissa by repeatedly adding descending powers of two until you run out of precision.

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following?
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = f1 / f2;
and:
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = (double)f1 / (double)f2;
I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?
Some practical guidelines for using this kind of cast would be nice as well.
I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.
In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.
Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.
If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.
For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.
For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic
Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.
If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:
Function TriangleArea(A: Single, B:Single, C:Single): Single
Begin
Var S: Extended; (* S stands for Semi-perimeter *)
S := (A+B+C) * 0.5;
TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S)
End;
would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.
Anyway, if one were to write the above method into a modern language like C#:
public static float triangleArea(float a, float b, float c)
{
double s = (a + b + c) * 0.5;
return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));
}
the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].
Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.
Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:
Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25
increases the number of additions to eight, but will work correctly even if they are performed at single precision.
"Accuracy gain when casting to double and back when doing float division?"
The result depends on other factors aside from only the 2 posted methods.
C allows evaluation of float operations to happen at different levels depending on FLT_EVAL_METHOD. (See below table) If the current setting is 1 or 2, the two methods posted by OP will provide the same answer.
Depending on other code and compiler optimization levels, the quotient result may be used at wider precision in subsequent calculations in either of OP's cases.
Because of this, a float division that overflows or becomes to 0.0 (a result with total loss of precision) due to extreme float values, and if optimized for subsequent calculations may in fact not over/under flow as the quotient was carried forward as double.
To compel the quotient to become a float for future calculations in the midst of potential optimizations, code often uses volatile
volatile float result = f1 / f2;
C does not specify the precision of math operations, yet common application of standards like IEEE 754 provide the a single operation like binary32 divide will result in the closest answer representable. Should the divide occur at a wider format like double or long double, then the wider quotient conversion back to float experiences another rounding step that in rare occasions will result in a different answer than the direct float/float.
FLT_EVAL_METHOD
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the
long double type.
Practical guidelines:
Use float vs. double to conserve space when needed. (float is usually narrower, rarely the same, as double) If precision is important, use double (or long double).
Using float vs. double to improve speed may or may not work as a platform's native operations may all be double. It may be faster, same or slower - profile to find out. Much of C was originally designed with double as only level FP was carried out aside from double to/from float conversions. Later C has added functions like sinf() to facilitate faster, direct float operations. So the more modern the compiler/platform, more likely float will be faster. Again: profile to find out.

Gaussian integral and double division

For fun, I was trying to evaluate the Gaussian integral from 0 to 1 using a series expansion. For this reason, I wrote a factorial function which works well up to 20!(I checked) and then I wrote this:
int main(){
int n;
long double result=0;
for(n=0; n<=5; n++){
if(n%2==0){
result+=(((long double) 1/(long double)(factorial(n)*(2*n+1))));
} else {
result-=(((long double) 1/(long double)(factorial(n)*(2*n+1))));
}
}
printf("The Gaussian integral from 0 to 1 is %Lf\n", result);
}
This gives me a strange negative number which is obviously not even close. I suspect the problem is with the cast, but I don't know what it is. Any thoughts? This is not the first thing I tried. I tried converting anything in the expression and putting the explicit cast at the beginning, but it didn't work.
You are using the MinGW compiler (port of gcc for Windows), which has issues with the long double type. This is due to conflicts between GCC's implementation of long double and Microsoft's C library. See also this question.
According to this question, defining __USE_MINGW_ANSI_STDIO may solve this. If not, using double instead will work.
In (long double)(factorial(n)*(2*n+1), the multiplications are integer multiplications and the first one could overflow if the result of factorial is already close to the limit of the integer type used.
Write ((long double)(factorial(n))*(2*n+1) so that the first multiplication is a floating-point multiplication.
You're almost certainly overflowing your integer type. In C this is technically undefined behaviour.
For 32 bit unsigned integer, 13! will overflow. On 64 bit, 21! will overflow.
Your algorithm will survive a little longer if you use a floating point double type or an extension like __uint128 (gives you, I think, up to 34!) if your compiler supports it.
Another problem that you have is that you are progressively adding terms of decreasing size to your total. That's never a good idea when working with floating point types. If you run your for loop in the reverse order then the result will be more accurate.

Resources