NaN conversion from float to double changes its underlying data

NaN conversion from float to double changes its underlying data - c

int main(){
unsigned int a = 2139156962;
float b = *(float*)&a; // nan(0xf1e2)
double c = (double)b; // nan canonicalization?
float d = (float)c; // nan(0x40f1e2), d != b
unsigned int e = *(int*)&d; // e != a
return 0;
}
NaN values can be represented in many different ways. As the example above shows, converting a NaN value of nan(0xf1e2) to a double type is not preserving the input bit patterns, and casting it back to a float doesn't return the same value as the original input.
From this link I can see on x64, CVTSS2SD seems to canonicalize Qnan inputs?
the sign bit is preserved, the 8-bit exponent FFH is replaced by the 11-bit exponent 7FFH, and the 24-bit significand is extended to a 53-bit significand by appending 29 bits equal to 0.
So regardless of what the Qnan bit pattern of our input is, the output will use 0x7FF and won't preserve all the original bits? Is this some sort of NaN canonicalization?
If so then this answer may not be fully accurate?
floats can be promoted to double and the value is unchanged.
Our output is still a NaN but the underlying data is now changed and c != b.

Related

what does float cf = (float )&ci; in C do?

i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.

It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.

If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.

float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.

Convert integer in a new floating point format

This code is intended to convert a signed 16-bit integer to a new floating point format (similar to the normal IEEE 754 floating point format). I unterstand the regular IEEE 754 floating point format, but i don't understand how this code works and how this floating point format looks like. I would be grateful for some insights into what the idea of the code is respectively how many bits are used for representing the significand and how many bits are used for representing the exponent in this new format.
#include <stdint.h>
uint32_t short2fp (int16_t inp)
{
int x, f, i;
if (inp == 0)
{
return 0;
}
else if (inp < 0)
{
i = -inp;
x = 191;
}
else
{
i = inp;
x = 63;
}
for (f = i; f > 1; f >>= 1)
x++;
for (f = i; f < 0x8000; f <<= 1);
return (x * 0x8000 + f - 0x8000);
}

This couple of tricks should help you recognize the parameters (exponent's size and mantissa's size) of a custom floating-point format:
First of all, how many bits is this float number long?
We know that the sign bit is the highest bit set in any negative float number. If we calculate short2fp(-1) we obtain 0b10111111000000000000000, that is a 23-bit number. Therefore, this custom float format is a 23-bit float.
If we want to know the exponent's and mantissa's sizes, we can convert the number 3, because this will set both the highest exponent's bit and the highest mantissa's bit. If we do short2fp(3), we obtain 0b01000000100000000000000, and if we split this number we get 0 1000000 100000000000000: the first bit is the sign, then we have 7 bits of exponent, and finally 15 bits of mantissa.
Conclusion:
Float format size: 23 bits
Exponent size: 7 bits
Mantissa size: 15 bits
NOTE: this conclusion may be wrong for a different number of reasons (e.g.: float format particularly different from IEEE754 ones, short2fp() function not working properly, too much coffee this morning, etc.), but in general this works for every binary floating-point format defined by IEEE754 (binary16, binary32, binary64, etc.) so I'm confident this works for your custom float format too.
P.S.: the short2fp() function is written very poorly, you may try improve its clearness if you want to investigate the inner workings of the function.

The two statements x = 191; and x = 63; set x to either 1•128 + 63 or 0•128 + 63, according to whether the number is negative or positive. Therefore 128 (27) has the sign bit at this point. As x is later multiplied by 0x8000 (215), the sign bit is 222 in the result.
These statements also initialize the exponent to 0, which is encoded as 63 due to a bias of 63. This follows the IEEE-754 pattern of using a bias of 2n−1−1 for an exponent field of n bits. (The “single” format has eight exponent bits and a bias of 27−1 = 127, and the “double” format has 11 exponent bits and a bias of 210−1 = 1023.) Thus we expect an exponent field of 7 bits, with bias 26−1 = 63.
This loop:
for (f = i; f > 1; f >>= 1)
x++;
detects the magnitude of i (the absolute value of the input), adding one to the exponent for each power of two that f is detected to exceed. For example, if the input is 4, 5, 6, or 7, the loop executes two times, adding two to x and reducing f to 1, at which point the loop stops. This confirms the exponent bias; if i is 1, x is left as is, so the initial value of 63 corresponds to an exponent of 0 and a represented value of 20 = 1.
The loop for (f = i; f < 0x8000; f <<= 1); scales f in the opposite direction, moving its leading bit to be in the 0x8000 position.
In return (x * 0x8000 + f - 0x8000);, x * 0x8000 moves the sign bit and exponent field from their initial positions (bit 7 and bits 6 to 0) to their final positions (bit 22 and bits 21 to 15). f - 0x8000 removes the leading bit from f, giving the trailing bits of the significand. This is then added to the final value, forming the primary encoding of the significand in bits 14 to 0.
Thus the format has the sign bit in bit 22, exponent bits in bits 21 to 15 with a bias of 63, and the trailing significand bits in bits 14 to 0.
The format could encode subnormal numbers, infinities, and NaNs in the usual way, but this is not discernible from the code shown, as it encodes only integers in the normal range.

As a comment suggested, I would use a small number of strategically selected test cases to reverse engineer the format. The following assumes an IEEE-754-like binary floating-point format using sign-magnitude encoding with a sign bit, exponent bits, and significand (mantissa) bits.
short2fp (1) = 001f8000 while short2fp (-1) = 005f8000. The exclusive OR of these is 0x00400000 which means the sign bit is in bit 22 and this floating-point format comprises 23 bits.
short2fp (1) = 001f8000, short2fp (2) = 00200000, and short2fp (4) = 00208000. The difference between consecutive values is 0x00008000 so the least significant bit of the exponent field is bit 15, the exponent field comprises 7 bits, and the exponent is biased by (0x001f8000 >> 15) = 0x3F = 63.
This leaves the least significant 15 bits for the significand. We can see from short2fp (2) = 00200000 that the integer bit of the significand (mantissa) is not stored, that is, it is implicit as in IEEE-754 formats like binary32 or binary64.

How is float to int type conversion done in C? [duplicate]

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.
From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?
As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?
I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.
Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Have you checked the IEEE 754 floating-point representation?
In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.

Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:
[-224...+224] = [-16777216...+16777216]
and some other values outside this range.
The whole range would be covered by this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
// value = 0xCF000000;
return (float)INT_MIN; // *((float*)&value); is undefined behaviour
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
union
{
int rint;
float rfloat;
}ret = { sign | exp | coeff };
return ret.rfloat;
}
Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

Error on casting unsigned int to float

For the following program.
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}
It is very strange that output is
d = 4294965248.000000
When I change the magic number 3 in the expression to calculate d to 3.0, I got correct result:
d = 2000.000000
If I change the type of a, b, c to int, I also got correct result.
I guess this error occurred by the conversion from unsigned int to float, but I do not know details about how the strange result was created.

I think you realize that you casting minus to unsigned int before assignment to float. If you run the below code, you will get highly likely 4294965296
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
printf("%u", -((a*b)*(c/3)));
return 0;
}
The -2000 to the right of your equals sign is set up as a signed
integer (probably 32 bits in size) and will have the hexadecimal value
0xFFFFF830. The compiler generates code to move this signed integer
into your unsigned integer x which is also a 32 bit entity. The
compiler assumes you only have a positive value to the right of the
equals sign so it simply moves all 32 bits into x. x now has the
value 0xFFFFF830 which is 4294965296 if interpreted as a positive
number. But the printf format of %d says the 32 bits are to be
interpreted as a signed integer so you get -2000. If you had used
%u it would have printed as 4294965296.
#include <stdio.h>
#include <limits.h>
int main()
{
float d = 4294965296;
printf("d = %f\n\n", d);
return 0;
}
When you convert 4294965296 to float, the number you are using is long to fit into the fraction part. Now that some precision was lost. Because of the loss, you got 4294965248.000000 as I got.
The IEEE-754 floating-point standard is a standard for representing
and manipulating floating-point quantities that is followed by all
modern computer systems.
bit 31 30 23 22 0
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
The bit numbers are counting from the least-significant bit. The first
bit is the sign (0 for positive, 1 for negative). The following
8 bits are the exponent in excess-127 binary notation; this
means that the binary pattern 01111111 = 127 represents an exponent
of 0, 1000000 = 128, represents 1, 01111110 = 126 represents
-1, and so forth. The mantissa fits in the remaining 24 bits, with
its leading 1 stripped off as described above. Source
As you can see, when doing conversion 4294965296 to float, precision which is 00011000 loss occurs.
11111111111111111111100 00011000 0 <-- 4294965296
11111111111111111111100 00000000 0 <-- 4294965248

This is because you use - on an unsigned int. The - inverts the bits of the number. Lets print some unsigned integers:
printf("Positive: %u\n", 2000);
printf("Negative: %u\n", -2000);
// Output:
// Positive: 2000
// Negative: 4294965296
Lets print the hex values:
printf("Positive: %x\n", 2000);
printf("Negative: %x\n", -2000);
// Output
// Positive: 7d0
// Negative: fffff830
As you can see, the bits are inverted. So the problem comes from using - on unsigned int, not from casting unsigned intto float.

As others have said, the issue is that you are trying to negate an unsigned number. Most of the solutions already given have you do some form of casting to float such that the arithmetic is done on floating point types. An alternate solution would be to cast the results of your arithmetic to int and then negate, that way the arithmetic operations will be done on integral types, which may or may not be preferable, depending on your actual use-case:
#include <stdio.h>
int main(void)
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -(int)((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}

Your whole calculation will be done unsigned so it is the same as
float d = -(2000u);
-2000 in unsigned int (assuming 32bits int) is 4294965295
this gets written in your float d. But as float can not save this exact number it gets saved as 4294965248.
As a rule of thumb you can say that float has a precision of 7 significant base 10 digits.
What is calculated is 2^32 - 2000 and then floating point precision does the rest.
If you instead use 3.0 this changes the types in your calculation as follows
float d = -((a*b)*(c/3.0));
float d = -((unsigned*unsigned)*(unsigned/double));
float d = -((unsigned)*(double));
float d = -(double);
leaving you with the correct negative value.

you need to cast the ints to floats
float d = -((a*b)*(c/3));
to
float d = -(((float)a*(float)b)*((float)c/3.0));

-((a*b)*(c/3)); is all performed in unsigned integer arithmetic, including the unary negation. Unary negation is well-defined for an unsigned type: mathematically the result is modulo 2N where N is the number of bits in unsigned int. When you assign that large number to the float, you encounter some loss of precision; the result, due to its binary magnitude, is the nearest number to the unsigned int that divides 2048.
If you change 3 to 3.0, then c / 3.0 is a double type, and the result of a * b is therefore converted to a double before being multiplied. This double is then assigned to a float, with the precision loss already observed.

why the result is nan in this c float conversion?

float and int types are all 4 bytes and I try converting in this way:
unsigned int x = 0; // 00000000
x = ~x>>1; // 7fffffff
float f = *((float *)&x);
printf("%f\n", f);
Because the first bit in c float number represents +/- and the next 8 bits is exp in 2^(exp-127) and the rest will be converted to 0.xxxxx..., it means I can get max float number:0|11111111|111...111 but finally I get a nan.
So is there anything wrong?

You are close, but your exponent is out of range so you have a NaN. FLT_MAX is:
0 11111110 11111111111111111111111
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
Note that the max exponent is 11111110, as 11111111 is reserved for NaNs.
The corresponding hex value is:
0x7f7fffff
So your code should be:
unsigned int x = 0x7f7fffff;
float f = *((float *)&x);
printf("%f\n", f);
and the result will be:
3.4028235E38
If you're interested in IEEE-754 format then check out this handy online calculator which converts between binary, hex and float formats: http://www.h-schmidt.net/FloatConverter/IEEE754.html

A bit-wise IEEE floating-point standard single precision (32-bit) NaN(Not a Number) would be:
s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx
where s is the sign (most often ignored in applications) and x is non-zero (the value zero encodes infinities).

In order to get the binary representation of the max float value, execute the "inverse":
float f = FLT_MAX;
int x = *((int*)&f);
printf("0x%.8X\n",x);
The result is 0x7F7FFFFF (and not 0x7FFFFFFF as you have assumed).
The C-language standard does not dictate sizeof(float) == sizeof(int).
So you will have to verify this on your platform in order to ensure correct execution.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

NaN conversion from float to double changes its underlying data - c

Related

what does float cf = (float )&ci; in C do?

Convert integer in a new floating point format

How is float to int type conversion done in C? [duplicate]

Error on casting unsigned int to float

why the result is nan in this c float conversion?

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

NaN conversion from float to double changes its underlying data - c

Related

what does float cf = *(float *)&ci; in C do?

Convert integer in a new floating point format

How is float to int type conversion done in C? [duplicate]

Error on casting unsigned int to float

why the result is nan in this c float conversion?

Categories

Resources

what does float cf = (float )&ci; in C do?