A long double is known to use 80 bits.
2^80 = 1208925819614629174706176;
Why, when declaring a variable such as:
long double a = 1208925819614629174706175; // 2^80 - 1
I get a warning saying: Integer constant is too large for its type.
1208925819614629174706175 is an integer literal, not a double. Your program would happily convert it, but it would have to be a valid integer first. Instead, use a long double literal: 1208925819614629174706175.0L.
Firstly, it is not known how many bits a long double type is using. It depends on the implementation.
Secondly, just because some floating-point type uses some specific number of bits it does not mean that this type can precisely represent an integer value using all these bits (if that's what you want). Floating-point types are called floating-point types because they represent non-integer values, which normally implies a non-trivial internal representation. Due to specifics of that representation, only a portion of these bits can be used for the actual digits of the number. This means that your 2^80 - 1 number will get truncated/rounded in one way or another. So, regardless of how you do it, don't be surprised if the compiler warns you about the data loss.
Thirdly, as other answers have already noted, the constant you are using in the text of your program is an integral constant. The limitations imposed on that constant have nothing to do with floating-point types at all. Use a floating-point constant instead of an integral one.
The value 1208925819614629174706175 is first crated as a const int, and then converted to a long double, when the assignment happens.
Related
In embedded C, I just input a value of 500000 into a 16-bit slot. It's giving me the warning of "Integer conversion resulted in truncation". In this event, does that mean the value is set to 65535, 41248 (which is the remainder of 500000/65536), or another value? Or is there not enough information given here to determine it's value and there are other factors at play? Please let me know if more info is needed.
(Sample code, in case it helps)
TA0CCR0 = 500000-1;
TA0CCTL1 = OUTMOD_7;
TA0CCR1 = 250000;
TA0CCTL2 = OUTMOD_7;
TA0CCR2 = 850;
TA0CTL = TASSEL__SMCLK | MC__UP | TACLR;
It depends on whether the variable is signed or unsigned.
Please see C18 §6.3.1.3
Signed and unsigned integers
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.
3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
The C language was designed to be "strongly typed, weakly checked" (direct quote from Dennis Ritchie, one of its authors), meaning that even though all type validation is done at compile time, the compiler will usually pick the path of least resistance when generating machine code. On most architectures, using a 16-bit type simply means that it will use 16-bit load and store instructions for that variable, which will automatically make its value mod(65536). So while the compiler notices that you're trying to put a value that's larger than (2^16)-1 into a 16-bit integer, it also won't really do anything about it. So yes, your variable will contain the value 41248.
does that mean the value is set to 65535, 41248 (which is the remainder of 500000/65536), or another value?
With unsigned types, the value is wrapped (mod 63336).
With signed values, it is implementation defined by the compiler. Commonly, it is also wrapped. Robust portable code does not assume this.
Recommendation: Use unsigned math and types to achieve specified consistent results.
I've just implemented a line of code, where two numbers need to be divided and the result needs to be rounded up to the next integer number. I started very naïvely:
i_quotient = ceil(a/b);
As the numbers a and b are both integer numbers, this did not work: the division gets executed as an integer division, which is rounding down by default, so I need to force the division to be a floating point operation:
i_quotient = ceil((double) a / b);
Now this seems to work, but it leaves a warning saying that I am trying to assign a double to an integer, and indeed, following the header file "math.h" the return type of the ceil() function is "double", and now I'm lost: what's the sense of a rounding function to return a double? Can anybody enlighten me about this?
A double has a range that can be greater than any integer type.
Returning double is the only way to ensure that the result type has a range that can handle all possible input.
ceil() takes a double as an argument. So, if it were to return an integer, what integer type would you choose that can still represent its ceiled value?
Whatever may be the type, it should be able to represent all possible double values.
The integer type that can hold the highest possible value is uintmax_t.
But that doesn't guarantee it can hold all double values even in some implementations it can.
So, it makes sense to return a double value for ceil(). If an integer value is needed, then the caller can always cast it to the desired integer type.
OP starts with two integers a,b and questions why a function double ceil(double) that takes a double, does not return some integer type.
Most floating-point math functions take floating point arguments and return the same type.
A big reason double ceil(double) does not return an integer type is because that limited functionality is rarely needed. Integer types have (or almost always have) a more limited range that double. ceil(DBL_MAX) is not expected to fit in an integer type.
There is little need to use double math to solve an integer problem.
If code needs to divide integers and round up the quotient, use the following. Ref:#mch
i_quotient = (a + b - 1) / b;
The above will handle most of OP's cases when a >= 0 and b > 0. Other considerations are needed when a or b are negative or if a + b - 1 may overflow.
Because why should it? Converting betwen int and double takes time. This overhead can become significant. If you want to convert a double to int do so explicitly:
i_quotient = (int)ceil((double) a / b);
Check this answer if you want to know more about this latency. You have to consider that C is quit old and achievable performance was one of the top priorities. But even C# and other modern languages usually return a floating value for ceil just for consistency.
Leaving technical discussions apart, couldn't be simply for consistency?
If the function takes a double it should return a result of the same type, if there's no particular reasons to return a different type.
It's up to the user to transform it to an integer if he needs to.
After all you may be working only with doubles in your application.
Although ceil means to round up to the next whole number , it doesn't mean strictly that it is an integer, it's obvious that an integer is a whole number but that doesn't have to prejudice our mind.
Going on understanding of these datatypes as primitives
(int) char, and (char) int are intepretations of data. (int) c gives the integer value of that character, and (char) 14 gives you back the character encoded by 14.
I've always understood this as being a "memory parse", such that it just takes the value at that position and then applies a type filter to it.
Given that floating points are stored as some version of scientific notation, what is stored in memory should be garbage as an integer. Looking into this utility http://www.h-schmidt.net/FloatConverter/IEEE754.html it appears that the whole number portion is separated.
However, since this is in the higher portion of memory, how does the int cast know to "reformat"? Does the compiler identify that it was a float and apply special handling, or what's going on?
Your understanding of casts is completely wrong. Casts are nothing but explicit requests for a value conversion from one type to another. They do not reinterpret the representation of one type as if it had a different type. The source code:
float f = 42.5;
int x;
x = (int)f;
simply instructs the compiler to produce code that truncates the floating point value of the expression f to an integer and store the result in the object x.
I've always understood this as being a "memory parse", such that it just takes the value at that position and then applies a type filter to it.
That is an incorrect understanding.
The language specifies conversions between the fundamental arithmetic types. Lookup "Usual Arithmetic Conversions" on the web. You will find a lot of links that describe that. For converting a floating point type to an integral type, this is what the C99 Standard has to say:
6.3.1.4 Real floating and integer
1 When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
float f = 4.5;
int i = (int); // i is 4
f = -6.3;
i = (int)f; // i is -6
When a double has an 'exact' integer value, like so:
double x = 1.0;
double y = 123123;
double z = -4.000000;
Is it guaranteed that it will round properly to 1, 123123, and -4 when cast to an integer type via (int)x, (int)y, (int)z? (And not truncate to 0, 123122 or -5 b/c of floating point weirdness). I ask b/c according to this page (which is about fp's in lua, a language that only has doubles as its numeric type by default), talks about how integer operations with doubles are exact according to IEEE 754, but I'm not sure if, when calling C-functions with integer type parameters, I need to worry about rounding doubles manually, or it is taken care of when the doubles have exact integer values.
Yes, if the integer value fits in an int.
A double could represent integer values that are out of range for your int type. For example, 123123.0 cannot be converted to an int if your int type has only 16 bits.
It's also not guaranteed that a double can represent every value a particular type can represent. IEEE 754 uses something like 52 or 53 bits for the mantissa. If your long has 64 bits, then converting a very large long to double and back might not give the same value.
As Daniel Fischer stated, if the value of the integer part of the double (in your case, the double exactly) is representable in the type you are converting to, the result is exact. If the value is out of range of the destination type, the behavior is undefined. “Undefined” means the standard allows any behavior: You might get the closest representable number, you might get zero, you might get an exception, or the computer might explode. (Note: While the C standard permits your computer to explode, or even to destroy the universe, it is likely the manufacturer’s specifications impose a stricter limit on the behavior.)
It will do it correctly, if it really is true integer, which you might be assured of in some contexts. But if the value is the result of previous floating point calculations, you could not easily know that.
Why not explicitly calculate the value with the floor() function, as in long value = floor(x + 0.5). Or, even better, use the modf() function to inspect for an integer value.
Yes it will hold the exact value you give it because you input it in code. Sometimes in calculations it would yield 0.99999999999 for example but that is due to the error in calculating with doubles not its storing capacity
Can I compare a floating-point number to an integer?
Will the float compare to integers in code?
float f; // f has a saved predetermined floating-point value to it
if (f >=100){__asm__reset...etc}
Also, could I...
float f;
int x = 100;
x+=f;
I have to use the floating point value f received from an attitude reference system to adjust a position value x that controls a PWM signal to correct for attitude.
The first one will work fine. 100 will be converted to a float, and IEE754 can represent all integers exactly as floats, up to about 223.
The second one will also work but will be converted into an integer first, so you'll lose precision (that's unavoidable if you're turning floats into integers).
Since you've identified yourself as unfamiliar with the subtleties of floating point numbers, I'll refer you to this fine paper by David Goldberg: What Every Computer Scientist Should Know About Floating-Point Arithmetic (reprint at Sun).
After you've been scared by that, the reality is that most of the time floating point is a huge boon to getting calculations done. And modern compilers and languages (including C) handle conversions sensibly so that you don't have to worry about them. Unless you do.
The points raised about precision are certainly valid. An IEEE float effectively has only 24 bits of precision, which is less than a 32-bit integer. Use of double for intermediate calculations will push all rounding and precision loss out to the conversion back to float or int.
Mixed-mode arithmetic (arithmetic between operands of different types and/or sizes) is legal but fragile. The C standard defines rules for type promotion in order to convert the operands to a common representation. Automatic type promotion allows the compiler to do something sensible for mixed-mode operations, but "sensible" does not necessarily mean "correct."
To really know whether or not the behavior is correct you must first understand the rules for promotion and then understand the representation of the data types. In very general terms:
shorter types are converted to longer types (float to double, short to int, etc.)
integer types are converted to floating-point types
signed/unsigned conversions favor avoiding data loss (whether signed is converted to
unsigned or vice-versa depends on the size of the respective types)
Whether code like x > y (where x and y have different types) is right or wrong depends on the values that x and y can take. In my experience it's common practice to prohibit (via the coding standard) implicit type conversions. The programmer must consider the context and explicitly perform any type conversions necessary.
Can you compare a float and an integer, sure. But the problem you will run into is precision. On most C/C++ implementations, float and int have the same size (4 bytes) and wildly different precision levels. Neither type can hold all values of the other type. Since one type cannot be converted to the other type without loss of precision and the types cannot be native compared, doing a comparison without considering another type will result in precision loss in some scenarios.
What you can do to avoid precision loss is to convert both types to a type which has enough precision to represent all values of float and int. On most systems, double will do just that. So the following usually does a non-lossy comparison
float f = getSomeFloat();
int i = getSomeInt();
if ( (double)i == (double)f ) {
...
}
LHS defines the precision,
So if your LHS is int and RHS is float, then this results in loss of precision.
Also take a look at FP related CFAQ
Yes, you can compare them, you can do math on them without terribly much regard for which is which, in most cases. But only most. The big bugaboo is that you can check for f<i etc. but should not check for f==i. An integer and a float that 'should' be identical in value are not necessarily identical.
Yeah, it'll work fine. Specifically, the int will be converted to float for the purposes of the conversion. In the second one you'll need to cast to int but it should be fine otherwise.
Yes, and sometimes it'll do exactly what you expect.
As the others have pointed out, comparing, eg, 1.0 == 1 will work out, because the integer 1 is type cast to double (not float) before the comparison.
However, other comparisons may not.
About that, the notation 1.0 is of type double so the comparison is made in double by type promotion rules like said before. 1.f or 1.0f is of type float and the comparison would have been made in float. And it would have worked as well since we said that 2^23 first integers are representible in a float.