why the result is nan in this c float conversion? - c

float and int types are all 4 bytes and I try converting in this way:
unsigned int x = 0; // 00000000
x = ~x>>1; // 7fffffff
float f = *((float *)&x);
printf("%f\n", f);
Because the first bit in c float number represents +/- and the next 8 bits is exp in 2^(exp-127) and the rest will be converted to 0.xxxxx..., it means I can get max float number:0|11111111|111...111 but finally I get a nan.
So is there anything wrong?

You are close, but your exponent is out of range so you have a NaN. FLT_MAX is:
0 11111110 11111111111111111111111
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
Note that the max exponent is 11111110, as 11111111 is reserved for NaNs.
The corresponding hex value is:
0x7f7fffff
So your code should be:
unsigned int x = 0x7f7fffff;
float f = *((float *)&x);
printf("%f\n", f);
and the result will be:
3.4028235E38
If you're interested in IEEE-754 format then check out this handy online calculator which converts between binary, hex and float formats: http://www.h-schmidt.net/FloatConverter/IEEE754.html

A bit-wise IEEE floating-point standard single precision (32-bit) NaN(Not a Number) would be:
s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx
where s is the sign (most often ignored in applications) and x is non-zero (the value zero encodes infinities).

In order to get the binary representation of the max float value, execute the "inverse":
float f = FLT_MAX;
int x = *((int*)&f);
printf("0x%.8X\n",x);
The result is 0x7F7FFFFF (and not 0x7FFFFFFF as you have assumed).
The C-language standard does not dictate sizeof(float) == sizeof(int).
So you will have to verify this on your platform in order to ensure correct execution.

Related

NaN conversion from float to double changes its underlying data

int main(){
unsigned int a = 2139156962;
float b = *(float*)&a; // nan(0xf1e2)
double c = (double)b; // nan canonicalization?
float d = (float)c; // nan(0x40f1e2), d != b
unsigned int e = *(int*)&d; // e != a
return 0;
}
NaN values can be represented in many different ways. As the example above shows, converting a NaN value of nan(0xf1e2) to a double type is not preserving the input bit patterns, and casting it back to a float doesn't return the same value as the original input.
From this link I can see on x64, CVTSS2SD seems to canonicalize Qnan inputs?
the sign bit is preserved, the 8-bit exponent FFH is replaced by the 11-bit exponent 7FFH, and the 24-bit significand is extended to a 53-bit significand by appending 29 bits equal to 0.
So regardless of what the Qnan bit pattern of our input is, the output will use 0x7FF and won't preserve all the original bits? Is this some sort of NaN canonicalization?
If so then this answer may not be fully accurate?
floats can be promoted to double and the value is unchanged.
Our output is still a NaN but the underlying data is now changed and c != b.

what does float cf = *(float *)&ci; in C do?

i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.
It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.
If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.
float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.

Error on casting unsigned int to float

For the following program.
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}
It is very strange that output is
d = 4294965248.000000
When I change the magic number 3 in the expression to calculate d to 3.0, I got correct result:
d = 2000.000000
If I change the type of a, b, c to int, I also got correct result.
I guess this error occurred by the conversion from unsigned int to float, but I do not know details about how the strange result was created.
I think you realize that you casting minus to unsigned int before assignment to float. If you run the below code, you will get highly likely 4294965296
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
printf("%u", -((a*b)*(c/3)));
return 0;
}
The -2000 to the right of your equals sign is set up as a signed
integer (probably 32 bits in size) and will have the hexadecimal value
0xFFFFF830. The compiler generates code to move this signed integer
into your unsigned integer x which is also a 32 bit entity. The
compiler assumes you only have a positive value to the right of the
equals sign so it simply moves all 32 bits into x. x now has the
value 0xFFFFF830 which is 4294965296 if interpreted as a positive
number. But the printf format of %d says the 32 bits are to be
interpreted as a signed integer so you get -2000. If you had used
%u it would have printed as 4294965296.
#include <stdio.h>
#include <limits.h>
int main()
{
float d = 4294965296;
printf("d = %f\n\n", d);
return 0;
}
When you convert 4294965296 to float, the number you are using is long to fit into the fraction part. Now that some precision was lost. Because of the loss, you got 4294965248.000000 as I got.
The IEEE-754 floating-point standard is a standard for representing
and manipulating floating-point quantities that is followed by all
modern computer systems.
bit 31 30 23 22 0
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
The bit numbers are counting from the least-significant bit. The first
bit is the sign (0 for positive, 1 for negative). The following
8 bits are the exponent in excess-127 binary notation; this
means that the binary pattern 01111111 = 127 represents an exponent
of 0, 1000000 = 128, represents 1, 01111110 = 126 represents
-1, and so forth. The mantissa fits in the remaining 24 bits, with
its leading 1 stripped off as described above. Source
As you can see, when doing conversion 4294965296 to float, precision which is 00011000 loss occurs.
11111111111111111111100 00011000 0 <-- 4294965296
11111111111111111111100 00000000 0 <-- 4294965248
This is because you use - on an unsigned int. The - inverts the bits of the number. Lets print some unsigned integers:
printf("Positive: %u\n", 2000);
printf("Negative: %u\n", -2000);
// Output:
// Positive: 2000
// Negative: 4294965296
Lets print the hex values:
printf("Positive: %x\n", 2000);
printf("Negative: %x\n", -2000);
// Output
// Positive: 7d0
// Negative: fffff830
As you can see, the bits are inverted. So the problem comes from using - on unsigned int, not from casting unsigned intto float.
As others have said, the issue is that you are trying to negate an unsigned number. Most of the solutions already given have you do some form of casting to float such that the arithmetic is done on floating point types. An alternate solution would be to cast the results of your arithmetic to int and then negate, that way the arithmetic operations will be done on integral types, which may or may not be preferable, depending on your actual use-case:
#include <stdio.h>
int main(void)
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -(int)((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}
Your whole calculation will be done unsigned so it is the same as
float d = -(2000u);
-2000 in unsigned int (assuming 32bits int) is 4294965295
this gets written in your float d. But as float can not save this exact number it gets saved as 4294965248.
As a rule of thumb you can say that float has a precision of 7 significant base 10 digits.
What is calculated is 2^32 - 2000 and then floating point precision does the rest.
If you instead use 3.0 this changes the types in your calculation as follows
float d = -((a*b)*(c/3.0));
float d = -((unsigned*unsigned)*(unsigned/double));
float d = -((unsigned)*(double));
float d = -(double);
leaving you with the correct negative value.
you need to cast the ints to floats
float d = -((a*b)*(c/3));
to
float d = -(((float)a*(float)b)*((float)c/3.0));
-((a*b)*(c/3)); is all performed in unsigned integer arithmetic, including the unary negation. Unary negation is well-defined for an unsigned type: mathematically the result is modulo 2N where N is the number of bits in unsigned int. When you assign that large number to the float, you encounter some loss of precision; the result, due to its binary magnitude, is the nearest number to the unsigned int that divides 2048.
If you change 3 to 3.0, then c / 3.0 is a double type, and the result of a * b is therefore converted to a double before being multiplied. This double is then assigned to a float, with the precision loss already observed.

Why is 1.0f in C code represented as 1065353216 in the generated assembly?

In C I have this code block:
if(x==1){
a[j][i]=1;
}
else{
a[j][i]=0;
}
a is a matrix of float values, if I try to see the compiled assembly of this code in nasm syntax
the line a[j][i]=0; assignment, was coded in this way
dword [rsi+rdi], 0
but the line a[j][i]=1; assignment, was coded in this way
dword [rsi+rdi], 1065353216
How can 1065353216 represent a 1.0f??
Because 1065353216 is the unsigned 32-bit integer representation of the 32-bit floating point value 1.0.
More specifically, 1.0 as a 32-bit float becomes:
0....... ........ ........ ........ sign bit (zero is positive)
.0111111 1....... ........ ........ exponent (127, which means zero)
........ .0000000 00000000 00000000 mantissa (zero, no correction needed)
___________________________________
00111111 10000000 00000000 00000000 result
So the end result is 2^0 + 0, which is 1 + 0, which is 1.
You can use binaryconvert.com or this useful converter to see other values.
As to why 127 suddenly means zero in the exponent: it's actually a pretty clever trick called exponent bias that makes it easier to compare floating-point values. Try out the converter with wildly different values (10, 100, 1000...) and you'll see the the exponent increases as well. Sorting is also the reason the sign bit is the first bit stored.
The float is represented in binary32 format. The positive floats go from 0.0f (whose bits when interpreted as integer represent 0) to +inf (whose bits interpreted as integer represent approximately 2000000000).
The number 1.0f is almost exactly halfway between these two extremes. There are approximately as many positive float numbers below it (10-1, 10-2, …) as there are values above it (101, 102, …). For this reason the value of 1.0f when its bits are interpreted as an integer is near 1000000000.
You can see the binary representation of the floating number 1.0 with the following lines of code:
#include <stdio.h>
int main(void) {
float a = 1.0;
printf("in hex, this is %08x\n", *((int*)(&a)));
printf("the int representation is %d\n", *((int*)(&a)));
return 0;
}
This results in
in hex, this is 3f800000
the int representation is 1065353216
The format of a 32 bit floating point number is given by
1 sign bit (s) = 0
8 exponent bits (e) = 7F = 127
23 mantissa bits (m) = 0
You add a (implied) 1 in front of the mantissa - in the above case the mantissa is all zeros, and the implied value is
1000 0000 0000 0000 0000 0000
This is 2^23 or 8388608. Now you multiply by (-1)^sign - which is 1 in this case.
Finally, you multiply by 2^(exponent-150). Really, you should express the mantissa as a fraction (1.0000000) and multiply by 2^(exponent-127), but that's the same thing. Either way, the result is 1.0
That should clear it up for you.
UPDATE it was pointed out in the comments that my code example may invoke undefined behavior, although my gcc compiler generated no warnings / errors. The below code is a more correct way to prove that 1.0 is 1065353216 in int (for 32 bit int and float...):
#include <stdio.h>
union {
int i;
float a;
} either;
int main(void) {
either.a = 1.0;
printf("in hex, this is %08x\n", either.i);
printf("the int representation is %d\n", either.i);
return 0;
}

Bit shifting for fixed point arithmetic on float numbers in C

i wrote the following test code to check fixed point arithmetic and bit shifting.
void main(){
float x = 2;
float y = 3;
float z = 1;
unsigned int * px = (unsigned int *) (& x);
unsigned int * py = (unsigned int *) (& y);
unsigned int * pz = (unsigned int *) (& z);
*px <<= 1;
*py <<= 1;
*pz <<= 1;
*pz =*px + *py;
*px >>= 1;
*py >>= 1;
*pz >>= 1;
printf("%f %f %f\n",x,y,z);
}
The result is
2.000000 3.000000 0.000000
Why is the last number 0? I was expecting to see a 5.000000
I want to use some kind of fixed point arithmetic to bypass the use of floating point numbers on an image processing application. Which is the best/easiest/most efficient way to turn my floating point arrays into integers? Is the above "tricking the compiler" a robust workaround? Any suggestions?
If you want to use fixed point, dont use type 'float' or 'double' because them has internal structure. Floats and Doubles have specific bit for sign; some bits for exponent, some for mantissa (take a look on color image here); so they inherently are floating point.
You should either program fixed point by hand storing data in integer type, or use some fixed-point library (or language extension).
There is a description of Floating point extensions implemented in GCC: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
There is some MACRO-based manual implementation of fixed-point for C: http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
What you are doing are cruelties to the numbers.
First, you assign values to float variables. How they are stored is system dependant, but normally, IEEE 754 format is used. So your variables internally look like
x = 2.0 = 1 * 2^1 : sign = 0, mantissa = 1, exponent = 1 -> 0 10000000 00000000000000000000000 = 0x40000000
y = 3.0 = 1.5 * 2^1 : sign = 0, mantissa = 1.5, exponent = 1 -> 0 10000000 10000000000000000000000 = 0x40400000
z = 1.0 = 1 * 2^0 : sign = 0, mantissa = 1, exponent = 0 -> 0 01111111 00000000000000000000000 = 0x3F800000
If you do some bit shiftng operations on these numbers, you mix up the borders between sign, exponent and mantissa and so anything can, may and will happen.
In your case:
your 2.0 becomes 0x80000000, resulting in -0.0,
your 3.0 becomes 0x80800000, resulting in -1.1754943508222875e-38,
your 1.0 becomes 0x7F000000, resulting in 1.7014118346046923e+38.
The latter you lose by adding -0.0 and -1.1754943508222875e-38, which becomes the latter, namely 0x80800000, which should be, after >>ing it by 1, 3.0 again. I don't know why it isn't, probably because I made a mistake here.
What stays is that you cannot do bit-shifting on floats an expect a reliable result.
I would consider converting them to integer or other fixed-point on the ARM and sending them over the line as they are.
It's probable that your compiler uses IEEE 754 format for floats, which in bit terms, looks like this:
SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF
^ bit 31 ^ bit 0
S is the sign bit s = 1 implies the number is negative.
E bits are the exponent. There are 8 exponent bits giving a range of 0 - 255 but the exponent is biased - you need to subtract 127 to get the true exponent.
F bits are the fraction part, however, you need to imagine an invisible 1 on the front so the fraction is always 1.something and all you see are the binary fraction digits.
The number 2 is 1 x 21 = 1 x 2128 - 127 so is encoded as
01000000000000000000000000000000
So if you use a bit shift to shift it right you get
10000000000000000000000000000000
which by convention is -0 in IEEE754, so rather than multiplying your number by 2 your shift has made it zero.
The number 3 is [1 + 0.5] x 2128 - 127
which is represented as
01000000010000000000000000000000
Shifting that left gives you
10000000100000000000000000000000
which is -1 x 2-126 or some very small number.
You can do the same for z, but you probably get the idea that shifting just screws up floating point numbers.
Fixed point doesn't work that way. What you want to do is something like this:
void main(){
// initing 8bit fixed point numbers
unsigned int x = 2 << 8;
unsigned int y = 3 << 8;
unsigned int z = 1 << 8;
// adding two numbers
unsigned int a = x + y;
// multiplying two numbers with fixed point adjustment
unsigned int b = (x * y) >> 8;
// use numbers
printf("%d %d\n", a >> 8, b >> 8);
}

Resources