storing a big float into an integer (cast and no cast) - c

What does the standard (are there differences in the standards?) say about assigning a float number out of the range of an integer to this integer?
So what should happen here,
assuming 16 bit short, to keep the number small (USHRT_MAX == 65535)
float f = 100000.0f;
short s = f;
s = (short) f;
unsigned short us = f;
us = (unsigned short) f;

This is undefined behaviour (with no diagnostic required). See C11 6.3.1.4 (earlier standards had similar text):
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
So, assuming your system has USHRT_MAX as 65535, short s = f; and all subsequent lines cause undefined behaviour.

Related

Guarantees of data preservation between int32_t and float?

I'm reading through the supplied code for a demo board(specifically the DC21561A) and found this snippet of code:
int32_t min_current_threshold_code;
min_current_threshold_code = (min_current_threshold / LTC2946_DELTA_SENSE_lsb) * resistor;
ack |= LTC2946_write_16_bits(LTC2946_I2C_ADDRESS, LTC2946_MIN_DELTA_SENSE_THRESHOLD_MSB_REG, (min_current_threshold_code << 4));
Here, everything on the RHS of first assignment is a float. From what I can tell, and have tested, an assignment with LHS int32_t and RHS float the decimal bits of the float will be discarded and only the integer is left; i.e. '1.5 * 3.5 = 5'.
The data above is written to a register over I2C. I assume the floats are used to give a more accurate estimate of threshold values. However, I was wondering if this truncation when assigning a float to an int32_t is required by the C(or C++) standard or something compiler specific?
Edit** Some people have asked for more code. While my question is answered, here's the rest for thoroughness.
At the top of the file there is
float min_current_threshold = read_float();
const float LTC2946_DELTA_SENSE_lsb = 2.5006105E-05;
const float resistor = .02;
Yes, the standards mandates the truncation, for example in C99 chapter 6.3.1.4 paragraph 1:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero).
6.3.1.4 states:
When a finite value of real floating type is converted to an integer
type other than _Bool, the fractional part is discarded (i.e., the
value is truncated toward zero). If the value of the integral part
cannot be represented by the integer type, the behavior is undefined.
61)
And then there is a non-normative foot note explaining the above text:
61) The remaindering operation performed when a value of integer type
is converted to unsigned type need not be performed when a value of
real floating type is converted to unsigned type. Thus, the range of
portable real floating values is (−1, Utype_MAX+1).

cast float to unsigned int in C with gcc

I am using gcc to test some simple casts between float to unsigned int.
The following piece of code gives the result 0.
const float maxFloat = 4294967295.0;
unsigned int a = (unsigned int) maxFloat;
printf("%u\n", a);
0 is printed (which I belive is very strange).
On the other hand the following piece of code:
const float maxFloat = 4294967295.0;
unsigned int a = (unsigned int) (signed int) maxFloat;
printf("%u\n", a);
prints 2147483648 which I belive is the correct results.
What happens that I get 2 different results?
If you first do this:
printf("%f\n", maxFloat);
The output you'll get is this:
4294967296.000000
Assuming a float is implemented as an IEEE754 single precision floating point type, the value 4294967295.0 cannot be represented exactly by this type because there's aren't enough bits of precision. The closest value it can store is 4294967296.0.
Assuming an int (and likewise unsigned int) is 32 bits, the value 4294967296.0 is outside the range of both of these types. Converting a floating point type to an integer type when the value cannot be represented in the given integer type invokes undefined behavior.
This is detailed in section 6.3.1.4 of the C standard which dictates conversion from floating point types to integer types:
1 When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e.,
the value is truncated toward zero). If the value of the integral part
cannot be represented by the integer type, the behavior is undefined.61)
...
61) The remaindering operation performed when a value of integer type
is converted to unsigned type need not be performed when a value of
real floating type is converted to unsigned type. Thus, the range of
portable real floating values is (−1, Utype_MAX+1).
The footnote in the above passage is referencing section 6.3.1.3, which details integer to integer conversions:
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new
type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an
implementation-defined signal is raised.
The behavior you see in the first code snippet is consistent with an out-of-range conversion to an unsigned type when the value in question is an integer, however because the value being converted has a floating point type it is undefined behavior.
Just because one implementation does this doesn't mean that all will. In fact, gcc gives a different result if you change the optimization settings.
For example, on my machine using gcc 5.4.0, given this code:
float n = 4294967296;
printf("n=%f\n", n);
unsigned int a = (unsigned int) n;
int b = (signed int) n;
unsigned int c = (unsigned int) (signed int) n;
printf("a=%u\n", a);
printf("b=%d\n", b);
printf("c=%u\n", c);
I get the following results with -O0:
n=4294967296.000000
a=0
b=-2147483648
c=2147483648
And this with -O1:
n=4294967296.000000
a=4294967295
b=2147483647
c=2147483647
If on the other hand n is defined as long or long long, you would always get this output:
n=4294967296
a=0
b=0
c=0
The conversion to unsigned is well defined by the C standard as sited above, and the conversion to signed is implementation defined, which gcc defines as follows:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object
of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N
to be within range of the type; no signal is raised.
Assuming IEEE 754 floating point numbers, the number 4294967295.0 can't be stored exactly in a float. It will be stored as 4294967296.0 instead (which is 232).
Further assuming your unsigned int has 32 value bits, this is just by one too large to fit in an unsigned int, so the result of the conversion is undefined according to the C standard -- 0 is a "reasonable" outcome.
In your second case, you have undefined behavior as well, and I have no theory what's happening here on the representation level. Fact is, the number is much too large for a 32 bit signed int (still assuming this is what your machine uses).
From this remark in your question:
prints 2147483648 which I belive is the correct results.
I assume you wanted to see the representation of your float in memory. Casting will convert the value, so that's not the way to see the representation. The following code would do:
int main(void) {
const float maxFloat = 4294967295.0;
unsigned char *floatBytes = &maxFloat;
for (int i=0; i < sizeof maxFloat; ++i)
{
printf("0x%02x ", floatBytes[i]);
}
puts("");
}
online example

What happens when casting floating point types to unsigned integer types when the value would overflow?

I'm wondering what happens when casting from a floating point type to an unsigned integer type in C when the value can't be accurately represented by the integer type in question. Take for instance
func (void)
{
float a = 1E10;
unsigned b = a;
}
The value of b I get on my system (with unsigned on my system being able to represent values from 0 to 232-1) is 1410065408. This seems sensible to me because it's simply the lowest order bits of the result of the cast.
I believe the behavior of operations such as these is undefined by the standard. Am I wrong? What can I expect in practice if I do things like this?
Also, what happens with signed types? If b is of type int, I get -2147483648, which doesn't really make sense to me.
What happens when casting floating point types to unsigned integer types when the value would overflow (?)
undefined behavior (UB)
In addition #user694733 fine answer, to prevent undefined behavior caused by out of range float to unsigned code can first test the float value.
Yet testing for the range is tricky, for unsigned types and especially for signed types. The detail is that all conversions and constants prior to the integer conversion must be exact. FP math near the limits needs to be exact too.
Examples:
Conversion to a 32-bit unsigned is valid for the range -0.999... to 4294967295.999....
Conversion to a 32-bit 2's complement signed is valid for the range -2147483648.999... to 2147483647.999....
// code uses FP constants that are exact powers-of-2 to insure their exact encoding.
// Form a FP constant that is exactly UINT_MAX + 1
#define FLT_UINT_MAX_P1 ((UINT_MAX/2 + 1)*2.0f)
bool convert_float_to_unsigned(unsigned *u, float f) {
if (f > -1.0f && f < FLT_UINT_MAX_P1) {
*u = (unsigned) f;
return true;
}
return false; // out of range
}
#define FLT_INT_MAX_P1 ((INT_MAX/2 + 1)*2.0f)
bool convert_float_to_int(int *i, float f) {
#if INT_MIN == -INT_MAX
// Rare non 2's complement integer
if (fabsf(f) < FLT_INT_MAX_P1) {
*i = (int) f;
return true;
}
#else
// Do not use f + 1 > INT_MIN as it may incur rounding
// Do not use f > INT_MIN - 1.0f as it may incur rounding
// f - INT_MIN is expected to be exact for values near the limit
if (f - INT_MIN > -1 && f < FLT_INT_MAX_P1) {
*i = (int) f;
return true;
}
#endif
return false; // out of range
}
Pedantic code would take additional steps to cope with the rare FLT_RADIX 10.
FLT_EVAL_METHOD, which allows for float math be calculated at higher precision, may play a role, yet so far I do not see it negatively affecting the above solution.
In both cases value is out of range, so it's undefined behaviour.
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool,
the fractional part is discarded (i.e., the value is truncated toward zero). If the value of
the integral part cannot be represented by the integer type, the behavior is undefined. 61)
61) The remaindering operation performed when a value of integer type is converted to unsigned type
need not be performed when a value of real floating type is converted to unsigned type. Thus, the
range of portable real floating values is (−1, Utype_MAX+1).
To make this well defined code, you should check that value is within possible range before doing the conversion.

Problems casting a double into an unsigned char

Why does casting a double 728.3 to an unsigned char produce zero? 728 is 0x2D8, so shouldn't w be 0xD8 (216)?
int w = (unsigned char)728.3;
int x = (int)728.3;
int y = (int)(unsigned char)728.3;
int z = (unsigned char)(int)728.3;
printf( "%i %i %i %i", w, x, y, z );
// prints 0 728 0 216
From the C standard 6.3.1.4p1:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
So, unless you have >=10 bit unsigned char, your code invokes undefined behaviour.
Note that the cast explicitly tells the compiler you know what you are doing, thus suppresses a warning.
Supposing that unsigned char has 8 value bits, as is nearly (but not completely) certain for your implementation, the behavior of converting the double value 728.3 to type unsigned char is undefined, as specified by paragraph 6.3.1.4/1 of the standard:
When a finite value of real floating type is converted to an integer
type other than _Bool, the fractional part is discarded (i.e., the
value is truncated toward zero). If the value of the integral part
cannot be represented by the integer type, the behavior is undefined.
This applies to both your w and your y. It does not apply to your x, and the rules covering conversions between integer values (i.e. your z) are different.
Basically, then, there is no answer at the C level for why you see the specific results you do, nor for why I see different ones when I run your code. The behavior is undefined; I can be thankful that it did not turn out to be an outpouring of nasal demons.

adding and subtracting float from unsigned short in C

I ran to some problem and it is driven me nuts.
I have a code like this
float a;
unsigned short b;
b += a;
When a is negative, b is going bananas.
I even did a cast
b += (unsigned short) a;
but it doesn't work.
What did I do wrong? How can I add float to a unsigned short?
FYI:
When 'a' is -1 and b is 0 then I'll see 'b +=a' will give b = 65535.
The way to add a float to an unsigned short is simply to add it, exactly as you've done. The operands of the addition will undergo conversions, as I'll describe below.
A simple example, based on your code, is:
#include <stdio.h>
int main(void) {
float a = 7.5;
unsigned short b = 42;
b += a;
printf("b = %hu\n", b);
return 0;
}
The output, unsurprisingly, is:
b = 49
The statement
b += a;
is equivalent to:
b = b + a;
(except that b is only evaluated once). When operands of different types are added (or subtracted, or ...), they're converted to a common type based on a set of rules you can find in the C standard section 6.3.1.8. In this case, b is converted from unsigned short to float. The addition is equivalent to 42.0f + 7.5f, which yields 49.5f. The assignment then converts this result from float to unsigned short, and the result,49is stored inb`.
If the mathematical result of the addition is outside the range of float (which is unlikely), or if it's outside the range of unsigned short (which is much more likely), then the program will have undefined behavior. You might see some garbage value stored in b, your program might crash, or in principle quite literally anything else could happen. When you convert a signed or unsigned integer to an unsigned integer type, the result is wrapped around; this does not happen when converting a floating-point value to an unsigned type.
Without more information, it's impossible to tell what problem you're actually having or how to fix it.
But it does seem that adding an unsigned short and a float and storing the result in an unsigned short is an unusual thing to do. There could be situations where it's exactly what you need (if so you need to avoid overflow), but it's possible that you'd be better off storing the result in something other than an unsigned short, perhaps in a float or double. (Incidentally, double is used more often than float for floating-point data; float is useful mostly for saving space when you have a lot of data.)
If you're doing numeric conversions, even implicit ones, it's often (but by no means always) an indication that you should have used a variable of a different type in the first place.
Your question would be improved by showing actual values you have trouble with, and explaining what value you expected to get.
But in the meantime, the definition of floating to integer conversion in C11 6.3.1.4/1 is:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
This comes into play at the point where the result of b + a, which is a float, is assigned back to b. Recall that b += a is equivalent to b = b + a.
If b + a is a negative number of -1 or greater magnitude, then its integral part is out of range for unsigned short so the code causes undefined behaviour which means anything can happen; including but not limited to going bananas.
A footnote repeats the point that the float is not first converted to a signed integer and then to unsigned short:
The remaindering operation performed when a value of integer type is converted to unsigned type need not be performed when a value of real floating type is converted to unsigned type. Thus, the range of portable real floating values is (−1, Utype_MAX+1)
As an improvement you could write:
b += (long long)a;
which will at least not cause UB so long as a > LLONG_MIN.
You want b to be positive (it is unsigned), but a can be negative. It is OK as long as a is not larger than b. This is first point.
Second - when you are casting negative value to unsign.. what actually the result is supposed to be? Number sign is stored in most significant bit and for negative values it is 1. When value is unsigned when if most significant bit is 1 the value is really high and has nothing in common with negative one.
Maybe trying b -= fabs(a) for negative a. Isn't that what you are looking for?
You are observing the combination of the float being converted to an integer, and unsigned integer wrap-around ( https://stackoverflow.com/a/9052112/1149664 ).
Consider
b += a
for example with a = -100.67 you add a negative value to a signed data type, and depending on the initial value of b the result aught to be negative. How come you got the idea to use an unsigned short and not just float or double for this task?

Resources