Why does Splint (the C code checker) give an error when comparing a float to an int? - c

Both are mathematical values, however the float does have more precision. Is that the only reason for the error - the difference in precision? Or is there another potential (and more serious) problem?

It's because the set of integer values does not equal the set of float values for the 'int' and 'float' types. For example, the float value 0.5 has no equal in the integer set and the integer value 4519245367 might not exist in the set of values a float can store. So, the checker flags this as an issue to be checked by the programmer.

Because it probably isn't a very good idea. Not all floats can be truncated to ints; not all ints can be converted to floats.

When doing the comparison, the integer value will get "promoted" to a floating point value. At that point you are doing a exact equality comparison between two floating point numbers, which is almost always a bad thing.
You should generally have some sort of "epsilon ball", or range of acceptable values, and you do the comparison if the two vaues are close enough to each other to be considered equal. You need a function roughly like this:
int double_equals(double a, double b, double epsilon)
{
return ( a > ( b - epsilon ) && a < ( b + epsilon ) );
}
If your application doesn't have an obvious choice of epsilon, then use DBL_EPSILON.

Because floats can't store an exact int value, so if you have two variables, int i and float f, even if you assign "i = f;", the comparison "if (i == f)" probably won't return true.

Assuming signed integers and IEEE floating point format, the magnitudes of integers that can be represented are:
short -> 15 bits
float -> 23 bits
long -> 31 bits
double -> 52 bits
Therefore a float can represent any short and a double can represent any long.

If you need to get around this (you have a legitimate reason and are happy none of the issues mentioned in the other answers are an issue for you) then just cast from one type to another.

Related

type of numbers that are not stored in a variable [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 4 years ago.
I'm a Little confused about some numbers in C-Code. I have the following piece of Code
int k;
float a = 0.04f;
for (k=0; k*a < 0.12; k++) {
* do something *
}
in this case, what is the type of "0.12" ? double ? float ? it has never been declared anywhere. Anyways, in my program the Loop above is excuted 4 times, even though 3*0.04=0.12 < 0.12 is not true. Once I Exchange 0.12 with 0.12F (because I am globally restricted to float precision in all of the program), the Loop is now executed 3 times. I do not understand why, and what is happening here. Are there any proper Guidelines on how to write such statements to not get unexpected issues?
Another related issue is the following: in the Definition of variables, say
float b = 1/180 * 3.14159265359;
what exactly "is" "1" in this case ? and "180" ? Integers ? Are they converted to float numbers ? Is it okay to write it like that ? Or should it be "1.0f/180.0f*3.14159265359f;"
Last part of the question,
if i have a fuction
void testfunction(float a)
which does some things.
if I call the fuction with testfunction(40.0/6.0), how is that Division handled ? It seems that its calculated with double precision and then converted to a float. Why ?
That was a Long question, I hope someone can help me understand it.
...numbers that are not stored in a variable
They are called "constants".
Any unsuffixed floating point constant has type double.
Quoting C11, chapter §6.4.4.2
An unsuffixed floating constant has type double. If suffixed by the letter f or F, it has
type float. If suffixed by the letter l or L, it has type long double.
For Integer constants, the type will depend on the value.
Quoting C11, chapter §6.4.4.1,
The type of an integer constant is the first of the corresponding list in which its value can
be represented. [..]
and for unsuffixed decimal number, the list is
int
long int
long long int
Regarding the mathematical operation accuracy of floating point numbers, see this post
"0.12" is a constant, and yes, without a trailing f or F, it will be interpreted as a double.
That number can not be expressed exactly as a binary fraction. When you compare k * a, the result is a float because both operands are floats. The result is slightly less than 0.12, but when you compare against a double, it gets padded out with zeros to the required size, which increases the discrepancy. When you use a float constant, the result is not padded (cast), and by good luck, comes out exactly equal to the binary representation of "0.12f". This would likely not be the case for a more complicated operation.
if we use a fractional number for eg, like 3.4 it's default type if double. You should explicitly specify it like 3.4f if you want it to consider as single precision (float).
if you call the following function
int fun()
{
float a= 1.2f;
double b = 1.2;
return (a==b);
}
this will always return zero (false).because before making comparison the float type is converted to double (lower type to higher type) , At this time sometimes it can't reproduce exact 1.2, a slight variations in value can be happen after 6th position from decimal point. You can check the difference by the following print statement
printf("double: %0.9f \n float: %0.9f\n",1.2,1.2f);
It results ike
double: 1.200000000
float: 1.200000481

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Why is the return type of the "ceil()" function "double" instead of some integer type?

I've just implemented a line of code, where two numbers need to be divided and the result needs to be rounded up to the next integer number. I started very naïvely:
i_quotient = ceil(a/b);
As the numbers a and b are both integer numbers, this did not work: the division gets executed as an integer division, which is rounding down by default, so I need to force the division to be a floating point operation:
i_quotient = ceil((double) a / b);
Now this seems to work, but it leaves a warning saying that I am trying to assign a double to an integer, and indeed, following the header file "math.h" the return type of the ceil() function is "double", and now I'm lost: what's the sense of a rounding function to return a double? Can anybody enlighten me about this?
A double has a range that can be greater than any integer type.
Returning double is the only way to ensure that the result type has a range that can handle all possible input.
ceil() takes a double as an argument. So, if it were to return an integer, what integer type would you choose that can still represent its ceiled value?
Whatever may be the type, it should be able to represent all possible double values.
The integer type that can hold the highest possible value is uintmax_t.
But that doesn't guarantee it can hold all double values even in some implementations it can.
So, it makes sense to return a double value for ceil(). If an integer value is needed, then the caller can always cast it to the desired integer type.
OP starts with two integers a,b and questions why a function double ceil(double) that takes a double, does not return some integer type.
Most floating-point math functions take floating point arguments and return the same type.
A big reason double ceil(double) does not return an integer type is because that limited functionality is rarely needed. Integer types have (or almost always have) a more limited range that double. ceil(DBL_MAX) is not expected to fit in an integer type.
There is little need to use double math to solve an integer problem.
If code needs to divide integers and round up the quotient, use the following. Ref:#mch
i_quotient = (a + b - 1) / b;
The above will handle most of OP's cases when a >= 0 and b > 0. Other considerations are needed when a or b are negative or if a + b - 1 may overflow.
Because why should it? Converting betwen int and double takes time. This overhead can become significant. If you want to convert a double to int do so explicitly:
i_quotient = (int)ceil((double) a / b);
Check this answer if you want to know more about this latency. You have to consider that C is quit old and achievable performance was one of the top priorities. But even C# and other modern languages usually return a floating value for ceil just for consistency.
Leaving technical discussions apart, couldn't be simply for consistency?
If the function takes a double it should return a result of the same type, if there's no particular reasons to return a different type.
It's up to the user to transform it to an integer if he needs to.
After all you may be working only with doubles in your application.
Although ceil means to round up to the next whole number , it doesn't mean strictly that it is an integer, it's obvious that an integer is a whole number but that doesn't have to prejudice our mind.

When a double with an integer value is cast to an integer, is it guaranteed to do it 'properly'?

When a double has an 'exact' integer value, like so:
double x = 1.0;
double y = 123123;
double z = -4.000000;
Is it guaranteed that it will round properly to 1, 123123, and -4 when cast to an integer type via (int)x, (int)y, (int)z? (And not truncate to 0, 123122 or -5 b/c of floating point weirdness). I ask b/c according to this page (which is about fp's in lua, a language that only has doubles as its numeric type by default), talks about how integer operations with doubles are exact according to IEEE 754, but I'm not sure if, when calling C-functions with integer type parameters, I need to worry about rounding doubles manually, or it is taken care of when the doubles have exact integer values.
Yes, if the integer value fits in an int.
A double could represent integer values that are out of range for your int type. For example, 123123.0 cannot be converted to an int if your int type has only 16 bits.
It's also not guaranteed that a double can represent every value a particular type can represent. IEEE 754 uses something like 52 or 53 bits for the mantissa. If your long has 64 bits, then converting a very large long to double and back might not give the same value.
As Daniel Fischer stated, if the value of the integer part of the double (in your case, the double exactly) is representable in the type you are converting to, the result is exact. If the value is out of range of the destination type, the behavior is undefined. “Undefined” means the standard allows any behavior: You might get the closest representable number, you might get zero, you might get an exception, or the computer might explode. (Note: While the C standard permits your computer to explode, or even to destroy the universe, it is likely the manufacturer’s specifications impose a stricter limit on the behavior.)
It will do it correctly, if it really is true integer, which you might be assured of in some contexts. But if the value is the result of previous floating point calculations, you could not easily know that.
Why not explicitly calculate the value with the floor() function, as in long value = floor(x + 0.5). Or, even better, use the modf() function to inspect for an integer value.
Yes it will hold the exact value you give it because you input it in code. Sometimes in calculations it would yield 0.99999999999 for example but that is due to the error in calculating with doubles not its storing capacity

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Resources