So, in C I am trying to give a floating point variable a number in binary bits (or hexadecimal digits) and then print it out, however it doesn't want to print the number I have calculated by hand or with an online converter.
float x = (float) 0b01000001110010000000000000000000;
or
float x = (float) 0x41C80000;
When printed out using
printf("%f", x);
produces results like this:
1103626240.000000
Instead of the expected 25, due to a sign bit of 0, exponent bit of 131, and a fraction of 1.5625.
Why is this, and how can I get the results I want?
The value 0x41C80000 in hex, is an integer that has the value 1103626240 in decimal. In your code, you are casting this value to a float which gives you this result:
x = 1103626240.000000
A solution for this can be made using a union:
union uint_to_float {
unsigned int u;
float f;
};
union uint_to_float u2f;
u2f.u = 0x41C80000;
printf("x = %f\n", u2f.f);
EDIT:
As mentioned by #chux, using uint32_t from stdint.h, instead of unsigned int is a better solution.
Related
How can I get a float or real value from integer division? For example:
double result = 30/233;
yields zero. I'd like the value with decimal places.
How can I then format so only two decimal places display when used with a string?
You could just add a decimal to either the numerator or the denominator:
double result = 30.0 / 233;
double result = 30 / 233.0;
Typecasting either of the two numbers also works.
As for the second part of the question, if you use printf-style format strings, you can do something like this:
sprintf(str, "result = %.2f", result);
Bascially, the ".2" represents how many digits to output after the decimal point.
If you have an integer (not integer constant):
int i = 20;
int j = 220;
double d = i/(double)j;
This is the simplest way to do what you are trying to achieve, I think..
double result = 30/233.0f;
for iOS development (iPhone/iPad/etc) better to use float type.
float result = 30/233.0f;
I have a very basic question. In my program, i am doing multiplication of two fixed point numbers, which is given below. My inputs are of Q1.31 format and output also should be of same format. In order to do this, i am storing the result of multiplication in a temporary 64 bit variable and then doing some operations to get the result in required format.
int conversion1(float input, int Q_FORMAT)
{
return ((int)(input * ((1 << Q_FORMAT)-1)));
}
int mul(int input1, int input2, int format)
{
__int64 result;
result = (__int64)input1 * (__int64)input2;//Q2.62 format
result = result << 1;//Q1.63 format
result = result >> (format + 1);//33.31 format
return (int)result;//Q1.31 format
}
int main()
{
int Q_FORMAT = 31;
float input1 = 0.5, input2 = 0.5;
int q_input1, q_input2;
int temp_mul;
float q_muls;
q_input1 = conversion1(input1, Q_FORMAT);
q_input2 = conversion1(input2, Q_FORMAT);
q_muls = ((float)temp_mul / ((1 << (Q_FORMAT)) - 1));
printf("result of multiplication using q format = %f\n", q_muls);
return 0;
}
My question is while converting float input to integer input (and also while converting int output
to float output), i am using (1<<Q_FORMAT)-1 format. But i have seen people using (1<<Q_FORMAT)
directly in their codes. The Problem i am facing when using (1<<Q_FORMAT) is i am getting the
negative of the desired result.
For example, in my program,
If i use (1<<Q_FORMAT), i am getting -0.25 as the result
But, if i use (1<<Q_FORMAT)-1, i am getting 0.25 as the result which is correct.
Where am i going wrong? Do i need to understand any other concepts?
On common platforms, int is a two’s complement 32-bit integer providing 31 digits (plus a 'sign' bit). It's a bit too narrow to represent a Q1.31 number which requires 32 digits (plus a 'sign' bit).
In your example, this is manifesting as effective arithmetic overflow in the expression, 1 << Q_FORMAT.
To avoid this, you need to either use a type providing more digits (e.g. long long) or a fixed-point format requiring fewer digits (e.g. Q1.30). You can use unsigned to fix your example but the result will be a 'sign' bit short of Q2.30.
I understand there are several topics same as mine, but I still don't really get it, so I'm expecting someone could explain this in a more simple but explicit way for me instead of pasting other topics' links, thanks.
Here's a sample code:
int a = 960;
int b = 16;
float c = a*0.001;
float d = a*0.001 + b;
double e = a*0.001 + b;
printf("%f\n%f\n%lf", c, d, e);
which outputs:
0.960000
16.959999
16.960000
My two questions are:
Why does adding an integer to a float ends up as the second output, but changing float to double solves the problem as the third output?
Why does the third output have the same number of digits with the first and second output after the decimal point since it should be a more precise value?
The reason why they produce the same number of decimal places, is because 6 is the default value. You can change that as in the edited example below, where the syntax is %.*f. The * can be either a number as shown below, or in the second case, supplied as another argument.
#include <stdio.h>
int main(void) {
int a = 960;
int b = 16;
float c = a*0.001;
float d = a*0.001 + b;
double e = a*0.001 + b;
printf("%.9f\n", c);
printf("%.*f\n", 9, d);
printf("%.16f\n", e);
}
Program output:
0.959999979
16.959999084
16.9600000000000009
The extra decimal places now shows that none of the results is exact. One reason is because 0.001 cannot be exactly coded as a floating point value. There are other reasons too, which have been extensively covered.
One easy way to understand why, is that a float has about 2^32 different values that can be encoded, however there is an infinity of real numbers within the range of float, and only about 2^32 of them can be represented exactly. In the case of the fraction 1/1000, in binary it is a recurring value (as is the fraction 1/3 in decimal).
I think the calculation a*0.001 will be done in double precision in both cases, then some precision is lost when you store it as a float.
You can choose how many decimal digits are printed by printf by writing e.g. "%.10lf" (to get 10 digits) instead of just "%lf".
I cannot figure out how to convert the value of a referenced float pointer when it is referenced from an integer casted into a float pointer. I'm sorry if I'm wording this incorrectly. Here is an example of what I mean:
#include <stdio.h>
main() {
int i;
float *f;
i = 1092616192;
f = (float *)&i;
printf("i is %d and f is %f\n", i, *f);
}
the output for f is 10. How did I get that result?
Normally, the value of 1092616192 in hexadecimal is 0x41200000.
In floating-point, that will give you:
sign = positive (0b)
exponent = 130, 2^3 (10000010b)
significand = 2097152, 1.25 (01000000000000000000000b)
2^3*1.25
= 8 *1.25
= 10
To explain the exponent part uses an offset encoding, so you have to subtract 127 from it to get the real value. 130 - 127 = 3. And since this is a binary encoding, we use 2 as the base. 2 ^ 3 = 8.
To explain the significand part, you start with an invisible 'whole' value of 1. the uppermost (leftmost) bit is half of that, 0.5. The next bit is half of 0.5, 0.25. Because only the 0.25 bit and the default '1' bit is set, the significand represents 1 + 0.25 = 1.25.
What you are trying to do is called type-punning. It should be done via a union, or using memcpy() and is only meaningful on an architecture where sizeof(int) == sizeof(float) without padding bits. The result is highly dependent on the architecture: byte ordering and floating point representation will affect the reinterpreted value. The presence of padding bits would invoke undefined behavior as the representation of float 15.0 could be a trap value for type int.
Here is how you get the number corresponding to 15.0:
#include <stdio.h>
int main(void) {
union {
float f;
int i;
unsigned int u;
} u;
u.f = 15;
printf("re-interpreting the bits of float %.1f as int gives %d (%#x in hex)\n",
u.f, u.i, u.u);
return 0;
}
output on an Intel PC:
re-interpreting the bits of float 15.0 as int gives 1097859072 (0x41700000 in hex)
You are trying to predict the consequence of an undefined activity - it depends on a lot of random things, and on the hardware and OS you are using.
Basically, what you are doing is throwing a glass against the wall and getting a certain shard. Now you are asking how to get a differently formed shard. well, you need to throw the glass differently against the wall...
In C programming, I find a weird problem, which counters my intuition. When I declare a integer as the INT_MAX (2147483647, defined in the limits.h) and implicitly convert it to a float value, it works fine, i.e., the float value is same with the maximum integer. And then, I convert the float back to an integer, something interesting happens. The new integer becomes the minimum integer (-2147483648).
The source codes look as below:
int a = INT_MAX;
float b = a; // b is correct
int a_new = b; // a_new becomes INT_MIN
I am not sure what happens when the float number b is converted to the integer a_new. So, is there any reasonable solution to find the maximum value which can be switched forth and back between integer and float type?
PS: The value of INT_MAX - 100 works fine, but this is just an arbitrary workaround.
This answer assumes that float is an IEEE-754 single precision float encoded as 32-bits, and that an int is 32-bits. See this Wikipedia article for more information about IEEE-754.
Floating point numbers only have 24-bits of precision, compared with 32-bits for an int. Therefore int values from 0 to 16777215 have an exact representation as floating point numbers, but numbers greater than 16777215 do not necessarily have exact representations as floats. The following code demonstrates this fact (on systems that use IEEE-754).
for ( int a = 16777210; a < 16777224; a++ )
{
float b = a;
int c = b;
printf( "a=%d c=%d b=0x%08x\n", a, c, *((int*)&b) );
}
The expected output is
a=16777210 c=16777210 b=0x4b7ffffa
a=16777211 c=16777211 b=0x4b7ffffb
a=16777212 c=16777212 b=0x4b7ffffc
a=16777213 c=16777213 b=0x4b7ffffd
a=16777214 c=16777214 b=0x4b7ffffe
a=16777215 c=16777215 b=0x4b7fffff
a=16777216 c=16777216 b=0x4b800000
a=16777217 c=16777216 b=0x4b800000
a=16777218 c=16777218 b=0x4b800001
a=16777219 c=16777220 b=0x4b800002
a=16777220 c=16777220 b=0x4b800002
a=16777221 c=16777220 b=0x4b800002
a=16777222 c=16777222 b=0x4b800003
a=16777223 c=16777224 b=0x4b800004
Of interest here is that the float value 0x4b800002 is used to represent the three int values 16777219, 16777220, and 16777221, and thus converting 16777219 to a float and back to an int does not preserve the exact value of the int.
The two floating point values that are closest to INT_MAX are 2147483520 and 2147483648, which can be demonstrated with this code
for ( int a = 2147483520; a < 2147483647; a++ )
{
float b = a;
int c = b;
printf( "a=%d c=%d b=0x%08x\n", a, c, *((int*)&b) );
}
The interesting parts of the output are
a=2147483520 c=2147483520 b=0x4effffff
a=2147483521 c=2147483520 b=0x4effffff
...
a=2147483582 c=2147483520 b=0x4effffff
a=2147483583 c=2147483520 b=0x4effffff
a=2147483584 c=-2147483648 b=0x4f000000
a=2147483585 c=-2147483648 b=0x4f000000
...
a=2147483645 c=-2147483648 b=0x4f000000
a=2147483646 c=-2147483648 b=0x4f000000
Note that all 32-bit int values from 2147483584 to 2147483647 will be rounded up to a float value of 2147483648. The largest int value that will round down is 2147483583, which the same as (INT_MAX - 64) on a 32-bit system.
One might conclude therefore that numbers below (INT_MAX - 64) will safely convert from int to float and back to int. But that is only true on systems where the size of an int is 32-bits, and a float is encoded per IEEE-754.