I have a 32 bit unsigned value that I want to convert to ieee 754 floating point.
Here are my variables:
unsigned int val = 0x3f800000;
float floatVal;
I know that I have to do the following to be able to properly cast the uint value to the ieee 754 floating point value I expect of 1.0
floatVal = *((float*)&val);
Can anyone explain why floatVal = (float)val does not return the same result? Instead of 1.0 floatVal gets a value of 1.0653532e+009. I believe it is partially due to the fact that bits are lost during this type of typecasting whereas the bits are preserved through pointer casting but a more detailed explanation would be much appreciated.
In current C standards (C99, C11), one should type-pun via an union, rather than dereferencing a type-cast pointer:
#include <stdint.h>
uint32_t float_bits(const float f)
{
union {
uint32_t u;
float f;
} temp;
temp.f = f;
return temp.u;
}
Most current architectures use IEEE-754 binary32 format for float type, but not all. If your program assumes this, it should say so in the documentation.
I personally like to check it at compile time, and fail the compilation if my code cannot cope.
Most current architectures use the same byte order (endianness) for integer and float types, but not all. Again, if your program assumes this, it should say so in the documentation, or check it at compile time (using a separate test program in the same project, for example).
In C, the expression (type)variable casts the value of variable variable to type type. For example:
int32_t my_truncate(float value)
{
return (int32_t)value;
}
If for example value == 2.125, then my_truncate(value) == 2.
Similarly, casting an integer value to a floating-point type, evaluates to a floating-point value that best represents the original integer value. For example, (float)425 == 425.0f. (The final f just tells the value is of float type. If there is no f at the end of a floating-point value, its type in C is double.)
"Storage representation" refers to how values are actually stored in memory.
The difference between casting and type-punning is that casting reinterprets the value itself, but type-punning reinterprets how the storage representation of the value is interpreted.
Thus, casting an integer value to a float type just yields a float that best represents the original integer value; but type-punning an integer value (of the same size as a float) to float means the storage representation of that integer value is treated as the storage representation of a float value, yielding the float value that has the same storage representation as the original integer value has.
Similarly in the other direction. Type-punning a float to an unsigned integer type, will yield the unsigned integer (and thus bits) that has the same storage representation as the original float had. This is exactly what the float_bits() example function above does.
In C99 and later versions of the C standard, as well as in POSIXy C implementations, you can use the frexpf() function to split a float into a normalized fraction: x × 2n, where either -1.0 < x <= -0.5f or 0.5 <= x < 1; and n is an integer exponent.
If need be, one can use this to construct the representation of the closest value binary32 can represent, in an uint32_t. This can be useful on architectures that does not use IEEE 754 binary32 for floats.
Related
I'm reading through the supplied code for a demo board(specifically the DC21561A) and found this snippet of code:
int32_t min_current_threshold_code;
min_current_threshold_code = (min_current_threshold / LTC2946_DELTA_SENSE_lsb) * resistor;
ack |= LTC2946_write_16_bits(LTC2946_I2C_ADDRESS, LTC2946_MIN_DELTA_SENSE_THRESHOLD_MSB_REG, (min_current_threshold_code << 4));
Here, everything on the RHS of first assignment is a float. From what I can tell, and have tested, an assignment with LHS int32_t and RHS float the decimal bits of the float will be discarded and only the integer is left; i.e. '1.5 * 3.5 = 5'.
The data above is written to a register over I2C. I assume the floats are used to give a more accurate estimate of threshold values. However, I was wondering if this truncation when assigning a float to an int32_t is required by the C(or C++) standard or something compiler specific?
Edit** Some people have asked for more code. While my question is answered, here's the rest for thoroughness.
At the top of the file there is
float min_current_threshold = read_float();
const float LTC2946_DELTA_SENSE_lsb = 2.5006105E-05;
const float resistor = .02;
Yes, the standards mandates the truncation, for example in C99 chapter 6.3.1.4 paragraph 1:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero).
6.3.1.4 states:
When a finite value of real floating type is converted to an integer
type other than _Bool, the fractional part is discarded (i.e., the
value is truncated toward zero). If the value of the integral part
cannot be represented by the integer type, the behavior is undefined.
61)
And then there is a non-normative foot note explaining the above text:
61) The remaindering operation performed when a value of integer type
is converted to unsigned type need not be performed when a value of
real floating type is converted to unsigned type. Thus, the range of
portable real floating values is (−1, Utype_MAX+1).
I am using gcc to test some simple casts between float to unsigned int.
The following piece of code gives the result 0.
const float maxFloat = 4294967295.0;
unsigned int a = (unsigned int) maxFloat;
printf("%u\n", a);
0 is printed (which I belive is very strange).
On the other hand the following piece of code:
const float maxFloat = 4294967295.0;
unsigned int a = (unsigned int) (signed int) maxFloat;
printf("%u\n", a);
prints 2147483648 which I belive is the correct results.
What happens that I get 2 different results?
If you first do this:
printf("%f\n", maxFloat);
The output you'll get is this:
4294967296.000000
Assuming a float is implemented as an IEEE754 single precision floating point type, the value 4294967295.0 cannot be represented exactly by this type because there's aren't enough bits of precision. The closest value it can store is 4294967296.0.
Assuming an int (and likewise unsigned int) is 32 bits, the value 4294967296.0 is outside the range of both of these types. Converting a floating point type to an integer type when the value cannot be represented in the given integer type invokes undefined behavior.
This is detailed in section 6.3.1.4 of the C standard which dictates conversion from floating point types to integer types:
1 When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e.,
the value is truncated toward zero). If the value of the integral part
cannot be represented by the integer type, the behavior is undefined.61)
...
61) The remaindering operation performed when a value of integer type
is converted to unsigned type need not be performed when a value of
real floating type is converted to unsigned type. Thus, the range of
portable real floating values is (−1, Utype_MAX+1).
The footnote in the above passage is referencing section 6.3.1.3, which details integer to integer conversions:
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new
type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an
implementation-defined signal is raised.
The behavior you see in the first code snippet is consistent with an out-of-range conversion to an unsigned type when the value in question is an integer, however because the value being converted has a floating point type it is undefined behavior.
Just because one implementation does this doesn't mean that all will. In fact, gcc gives a different result if you change the optimization settings.
For example, on my machine using gcc 5.4.0, given this code:
float n = 4294967296;
printf("n=%f\n", n);
unsigned int a = (unsigned int) n;
int b = (signed int) n;
unsigned int c = (unsigned int) (signed int) n;
printf("a=%u\n", a);
printf("b=%d\n", b);
printf("c=%u\n", c);
I get the following results with -O0:
n=4294967296.000000
a=0
b=-2147483648
c=2147483648
And this with -O1:
n=4294967296.000000
a=4294967295
b=2147483647
c=2147483647
If on the other hand n is defined as long or long long, you would always get this output:
n=4294967296
a=0
b=0
c=0
The conversion to unsigned is well defined by the C standard as sited above, and the conversion to signed is implementation defined, which gcc defines as follows:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object
of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N
to be within range of the type; no signal is raised.
Assuming IEEE 754 floating point numbers, the number 4294967295.0 can't be stored exactly in a float. It will be stored as 4294967296.0 instead (which is 232).
Further assuming your unsigned int has 32 value bits, this is just by one too large to fit in an unsigned int, so the result of the conversion is undefined according to the C standard -- 0 is a "reasonable" outcome.
In your second case, you have undefined behavior as well, and I have no theory what's happening here on the representation level. Fact is, the number is much too large for a 32 bit signed int (still assuming this is what your machine uses).
From this remark in your question:
prints 2147483648 which I belive is the correct results.
I assume you wanted to see the representation of your float in memory. Casting will convert the value, so that's not the way to see the representation. The following code would do:
int main(void) {
const float maxFloat = 4294967295.0;
unsigned char *floatBytes = &maxFloat;
for (int i=0; i < sizeof maxFloat; ++i)
{
printf("0x%02x ", floatBytes[i]);
}
puts("");
}
online example
I'm wondering what happens when casting from a floating point type to an unsigned integer type in C when the value can't be accurately represented by the integer type in question. Take for instance
func (void)
{
float a = 1E10;
unsigned b = a;
}
The value of b I get on my system (with unsigned on my system being able to represent values from 0 to 232-1) is 1410065408. This seems sensible to me because it's simply the lowest order bits of the result of the cast.
I believe the behavior of operations such as these is undefined by the standard. Am I wrong? What can I expect in practice if I do things like this?
Also, what happens with signed types? If b is of type int, I get -2147483648, which doesn't really make sense to me.
What happens when casting floating point types to unsigned integer types when the value would overflow (?)
undefined behavior (UB)
In addition #user694733 fine answer, to prevent undefined behavior caused by out of range float to unsigned code can first test the float value.
Yet testing for the range is tricky, for unsigned types and especially for signed types. The detail is that all conversions and constants prior to the integer conversion must be exact. FP math near the limits needs to be exact too.
Examples:
Conversion to a 32-bit unsigned is valid for the range -0.999... to 4294967295.999....
Conversion to a 32-bit 2's complement signed is valid for the range -2147483648.999... to 2147483647.999....
// code uses FP constants that are exact powers-of-2 to insure their exact encoding.
// Form a FP constant that is exactly UINT_MAX + 1
#define FLT_UINT_MAX_P1 ((UINT_MAX/2 + 1)*2.0f)
bool convert_float_to_unsigned(unsigned *u, float f) {
if (f > -1.0f && f < FLT_UINT_MAX_P1) {
*u = (unsigned) f;
return true;
}
return false; // out of range
}
#define FLT_INT_MAX_P1 ((INT_MAX/2 + 1)*2.0f)
bool convert_float_to_int(int *i, float f) {
#if INT_MIN == -INT_MAX
// Rare non 2's complement integer
if (fabsf(f) < FLT_INT_MAX_P1) {
*i = (int) f;
return true;
}
#else
// Do not use f + 1 > INT_MIN as it may incur rounding
// Do not use f > INT_MIN - 1.0f as it may incur rounding
// f - INT_MIN is expected to be exact for values near the limit
if (f - INT_MIN > -1 && f < FLT_INT_MAX_P1) {
*i = (int) f;
return true;
}
#endif
return false; // out of range
}
Pedantic code would take additional steps to cope with the rare FLT_RADIX 10.
FLT_EVAL_METHOD, which allows for float math be calculated at higher precision, may play a role, yet so far I do not see it negatively affecting the above solution.
In both cases value is out of range, so it's undefined behaviour.
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool,
the fractional part is discarded (i.e., the value is truncated toward zero). If the value of
the integral part cannot be represented by the integer type, the behavior is undefined. 61)
61) The remaindering operation performed when a value of integer type is converted to unsigned type
need not be performed when a value of real floating type is converted to unsigned type. Thus, the
range of portable real floating values is (−1, Utype_MAX+1).
To make this well defined code, you should check that value is within possible range before doing the conversion.
What does the standard (are there differences in the standards?) say about assigning a float number out of the range of an integer to this integer?
So what should happen here,
assuming 16 bit short, to keep the number small (USHRT_MAX == 65535)
float f = 100000.0f;
short s = f;
s = (short) f;
unsigned short us = f;
us = (unsigned short) f;
This is undefined behaviour (with no diagnostic required). See C11 6.3.1.4 (earlier standards had similar text):
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
So, assuming your system has USHRT_MAX as 65535, short s = f; and all subsequent lines cause undefined behaviour.
What are the implicit types for numbers in C? If, for example, I have a decimal number in a calculation, is the decimal always treated as a double? If I have a non-decimal number, is it always treated as an int? What if my non-decimal number is larger than an int value?
I'm curious because this affects type conversion and promotion. For instance, if I have the following calculation:
float a = 1.0 / 25;
Is 1.0 treated as a double and 25 treated as an int? Is 25 then promoted to a double, the calculation performed at double precision and then the result converted to a float?
What about:
double b = 1 + 2147483649; // note that the number is larger than an int value
If the number has neither a decimal point nor an exponent, it is an integer of some sort; by default, an int.
If the number has a decimal point or an exponent, it is a floating point number of some sort; by default, a double.
That's about it. You can append suffixes to numbers (such as ULL for unsigned long long) to specify the type more precisely. Otherwise (simplifying a little), integers are the smallest int type (of type int or longer) that will hold the value.
In your examples, the code is:
float a = 1.0 / 25;
double b = 1 + 2147483649;
The value of a is calculated by noting that 1.0 is a double and 25 is an integer. When processing the division, the int is converted to a double, the calculation is performed (producing a double), and the result is then coerced into a float for assignment to a. All of this can be done by the compiler, so the result will be pre-computed.
Similarly, on a system with 32-bit int, the value 214783649 is too big to be an int, so it will be treated as a signed type bigger than int (either long or long long); the 1 is added (yielding the same type), and then that value is converted to a double. Again, it is all done at compile time.
These computations are governed by the same rules as other computations in C.
The type rules for integer constants are detailed in §6.4.4.1 Integer constants of ISO/IEC 9899:1999. There's a table which details the types depending on the suffix (if any) and the type of constant (decimal vs octal or hexadecimal). For decimal constants, the value is always a signed integer; for octal or hexadecimal constants, the type can be signed or unsigned as required, and as soon as the value fits. Thanks to Daniel Fischer for pointing out my mistake.
http://en.wikipedia.org/wiki/Type_conversion
The standard has a general guideline for what you can expect but compilers have a superset of rules that encompass the standard as well as rules for optimizing. The above link discusses some of the the generalities you can expect. If you are concerned about the implicit coercion it is typically good practice to use explicit casting.
Keep in mind that the size of the primitive types is not guaranteed.
1.0 / 25
Evaluates to a double because one of the operands is a double. If you changed it to 1/25 the evaluation is performed as two integers and evaluates to 0.
double b = 1 + 2147483649;
The right side is evaluated as an integer and then coerced to a double during assignment.
actually. in your example you may get a compiler warning. You'd either write 1.0f to make it a float to start with, or explicitly cast your result before assigning it.