May I assume that (int)(float)n == n for any int n? At least I need this for non-negative, 31 bits values.
addendum. What about (int)(double)n ==n?
No, you can't. For ints that can't be represented exactly by a float, this will fail.
(The reason: float is generally a 32-bit IEEE-754 floating-point value. It only has 24 bits of precision, the rest is reserved for the exponent and the sign. So if your integer has more significant binary digits than 23, and it doesn't happen to be a multiple of an appropriate power of two, then it can't be represented precisely as a float.)
addendum. What about (int)(double)n ==n?
It's the same. For ints that can't be represented as a double, the comparison won't always yield true. However, generally, int is not long enough to accomplish this -- the widely-accepted implementation of double is a 64-bit IEEE-754 floating-point number which has 53 bits of precision, whereas ints tend to be at most 32 bits long. But you can always try to repeat the experiment with a long or a long long and a double instead.
Here's a demo:
#include <stdio.h>
int main()
{
int n = (1 << 24) + 1;
printf("n == n: %d\n" , n == n);
printf("n == (int)(float)n: %d\n", n == (int)(float)n);
return 0;
}
This prints:
n == n: 1
n == (int)(float)n: 0
Related
I am looking to check if a double value can be represented as an int (or the same for any pair of floating point an integer types). This is a simple way to do it:
double x = ...;
int i = x; // potentially undefined behaviour
if ((double) i != x) {
// not representable
}
However, it invokes undefined behaviour on the marked line, and triggers UBSan (which some will complain about).
Questions:
Is this method considered acceptable in general?
Is there a reasonably simple way to do it without invoking undefined behaviour?
Clarifications, as requested:
The situation I am facing right now involves conversion from double to various integer types (int, long, long long) in C. However, I have encountered similar situations before, thus I am interested in answers both for float -> integer and integer -> float conversions.
Examples of how the conversion may fail:
Float -> integer conversion may fail is the value is not a whole number, e.g. 3.5.
The source value may be out of the range of the target type (larger or small than max and min representable values). For example 1.23e100.
The source values may be +-Inf or NaN, NaN being tricky as any comparison with it returns false.
Integer -> float conversion may fail when the float type does not have enough precision. For example, typical double have 52 binary digits compared to 63 digits in a 64-bit integer type. For example, on a typical 64-bit system, (long) (double) ((1L << 53) + 1L).
I do understand that 1L << 53 (as opposed to (1L << 53) + 1) is technically exactly representable as a double, and that the code I proposed would accept this conversion, even though it probably shouldn't be allowed.
Anything I didn't think of?
Create range limits exactly as FP types
The "trick" is to form the limits without loosing precision.
Let us consider float to int.
Conversion of float to int is valid (for example with 32-bit 2's complement int) for -2,147,483,648.9999... to 2,147,483,647.9999... or nearly INT_MIN -1 to INT_MAX + 1.
We can take advantage that integer_MAX is always a power-of-2 - 1 and integer_MIN is -(power-of-2) (for common 2's complement).
Avoid the limit of FP_INT_MIN_minus_1 as it may/may not be exactly encodable as a FP.
// Form FP limits of "INT_MAX plus 1" and "INT_MIN"
#define FLOAT_INT_MAX_P1 ((INT_MAX/2 + 1)*2.0f)
#define FLOAT_INT_MIN ((float) INT_MIN)
if (f < FLOAT_INT_MAX_P1 && f - FLOAT_INT_MIN > -1.0f) {
// Within range.
Use modff() to detect a fraction if desired.
}
More pedantic code would use !isnan(f) and consider non-2's complement encoding.
Using known limits and floating-point number validity. Check what's inside limits.h header.
You can write something like this:
#include <limits.h>
#include <math.h>
// Of course, constants used are specific to "int" type... There is others for other types.
if ((isnormal(x)) && (x>=INT_MIN) && (x<=INT_MAX) && (round(x)==x))
// Safe assignation from double to int.
i = (int)x ;
else
// Handle error/overflow here.
ERROR(.....) ;
Code relies on lazy boolean evaluation, obviously.
Please refer to IEEE 754 representation of floating point numbers in Memory
https://en.wikipedia.org/wiki/IEEE_754
Take double as an example:
Sign bit: 1 bit
Exponent: 11 bits
Fraction: 52 bits
There are three special values to point out here:
If the exponent is 0 and the fractional part of the mantissa is 0, the number is ±0
If the exponent is 2047 and the fractional part of the mantissa is 0, the number is ±∞
If the exponent is 2047 and the fractional part of the mantissa is non-zero, the number is NaN.
This is an example of convert from double to int on 64-bit, just for reference
#include <stdint.h>
#define EXPBITS 11
#define FRACTIONBITS 52
#define GENMASK(n) (((uint64_t)1 << (n)) - 1)
#define EXPBIAS GENMASK(EXPBITS - 1)
#define SIGNMASK (~GENMASK(FRACTIONBITS + EXPBITS))
#define EXPMASK (GENMASK(EXPBITS) << FRACTIONBITS)
#define FRACTIONMASK GENMASK(FRACTIONBITS)
int double_to_int(double src, int *dst)
{
union {
double d;
uint64_t i;
} y;
int exp;
int sign;
int maxbits;
uint64_t fraction;
y.d = src;
sign = (y.i & SIGNMASK) ? 1 : 0;
exp = (y.i & EXPMASK) >> FRACTIONBITS;
fraction = (y.i & FRACTIONMASK);
// 0
if (fraction == 0 && exp == 0) {
*dst = 0;
return 0;
}
exp -= EXPBIAS;
// not a whole number
if (exp < 0)
return -1;
// out of the range of int
maxbits = sizeof(*dst) * 8 - 1;
if (exp >= maxbits && !(exp == maxbits && sign && fraction == 0))
return -2;
// not a whole number
if (fraction & GENMASK(FRACTIONBITS - exp))
return -3;
// convert to int
*dst = src;
return 0;
}
i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.
It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.
If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.
float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.
I'm new to coding in c and I've been trying to wrap my head around unsigned integers. This is the code I have:
#include <stdio.h>
int main(void)
{
unsigned int hours;
do
{
printf("Number of hours you spend sleeping a day: ");
scanf(" %u", &hours);
}
while(hours < 0);
printf("\nYour number is %u", hours);
}
However, when I run the code and use (-1) it does not ask the question again like it should and prints out (Your number is 4294967295) instead. If I change unsigned int to a normal int, the code works fine. Is there a way I can change my code to make the unsigned int work?
Appreciate any help!
Is there a way I can change my code to make the unsigned int work?
Various approaches possible.
Read as int and then convert to unsigned.
Given "Number of hours you spend sleeping a day: " implies a small legitimate range about 0 to 24, read as int and convert.
int input;
do {
puts("Number of hours you spend sleeping a day:");
if (scanf("%d", &input) != 1) {
Handle_non_text_input(); // TBD code for non-numeric input like "abc"
}
} while (input < 0 || input > 24);
unsigned hours = input;
An unsigned int cannot hold negative numbers. It is useful since it can store a full 32 bit number (twice as large as a regular int), but it cannot hold negative numbers So when you try to read your negative unsigned int, it is being read as a positive number. Although both int and unsigned int are 32 bit numbers, they will be interpreted much differently.
I would try the next test:
do:{
printf("enter valid input...")
scanf("new input...")
} while (hours > 24)
Why should it work?
An unsigned int in C is a binary number, with 32 bit. that means it's max value is 2^32 - 1.
Note that:
2^32 - 1 == 4294967295. That is no coincidence. Negative ints are usually represented using the "Two's complement" method.
A word about that method:
When I use a regular int, it's most significant bit is reserved for sign: 1 if negative, 0 if positive. A positive int than holds a 0 in it's most significant bit, and 1's and 0's on the remaining coordinates in the ordinary binary manner.
Negative ints, are represented differently:
Suppose K is a positive number, represented by N bits.
The number (-K) is represented using 1 in the most significant bit, and the POSITIVE NUMBER: (2^(N-1) - K) occupying the N-1 least significant bits.
Example:
Suppose N = 4, K = 7. Binary representation for 7 using 4 bit:
7 = 0111 (The most significant bit is reserved for sign, remember?)
-7 , on the other hand:
-7 = concat(1, 2^(4-1) - 7) == 1001
Another example:
1 = 0001, -1 = 1111.
Note that if we use 32 bits, -1 is 1...1 (altogether we have 32 1's). This is exactly the binary representation of the unsigned int 4294967295. When you use unsigned int, you instruct the compiler to refer to -1 as a positive number. This is where your unexpected "error" comes from.
Now - If you use the while(hours>24), you rule out most of the illegal input. I am not sure though if you rule out all illegal input. It might be possible to think of a negative number such that the compiler interpret it as a non-negative number in the range [0:24] when asked to ignore the sign, and refer to the most significant bit as 'just another bit'.
For the following program.
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}
It is very strange that output is
d = 4294965248.000000
When I change the magic number 3 in the expression to calculate d to 3.0, I got correct result:
d = 2000.000000
If I change the type of a, b, c to int, I also got correct result.
I guess this error occurred by the conversion from unsigned int to float, but I do not know details about how the strange result was created.
I think you realize that you casting minus to unsigned int before assignment to float. If you run the below code, you will get highly likely 4294965296
#include <stdio.h>
int main()
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
printf("%u", -((a*b)*(c/3)));
return 0;
}
The -2000 to the right of your equals sign is set up as a signed
integer (probably 32 bits in size) and will have the hexadecimal value
0xFFFFF830. The compiler generates code to move this signed integer
into your unsigned integer x which is also a 32 bit entity. The
compiler assumes you only have a positive value to the right of the
equals sign so it simply moves all 32 bits into x. x now has the
value 0xFFFFF830 which is 4294965296 if interpreted as a positive
number. But the printf format of %d says the 32 bits are to be
interpreted as a signed integer so you get -2000. If you had used
%u it would have printed as 4294965296.
#include <stdio.h>
#include <limits.h>
int main()
{
float d = 4294965296;
printf("d = %f\n\n", d);
return 0;
}
When you convert 4294965296 to float, the number you are using is long to fit into the fraction part. Now that some precision was lost. Because of the loss, you got 4294965248.000000 as I got.
The IEEE-754 floating-point standard is a standard for representing
and manipulating floating-point quantities that is followed by all
modern computer systems.
bit 31 30 23 22 0
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
The bit numbers are counting from the least-significant bit. The first
bit is the sign (0 for positive, 1 for negative). The following
8 bits are the exponent in excess-127 binary notation; this
means that the binary pattern 01111111 = 127 represents an exponent
of 0, 1000000 = 128, represents 1, 01111110 = 126 represents
-1, and so forth. The mantissa fits in the remaining 24 bits, with
its leading 1 stripped off as described above. Source
As you can see, when doing conversion 4294965296 to float, precision which is 00011000 loss occurs.
11111111111111111111100 00011000 0 <-- 4294965296
11111111111111111111100 00000000 0 <-- 4294965248
This is because you use - on an unsigned int. The - inverts the bits of the number. Lets print some unsigned integers:
printf("Positive: %u\n", 2000);
printf("Negative: %u\n", -2000);
// Output:
// Positive: 2000
// Negative: 4294965296
Lets print the hex values:
printf("Positive: %x\n", 2000);
printf("Negative: %x\n", -2000);
// Output
// Positive: 7d0
// Negative: fffff830
As you can see, the bits are inverted. So the problem comes from using - on unsigned int, not from casting unsigned intto float.
As others have said, the issue is that you are trying to negate an unsigned number. Most of the solutions already given have you do some form of casting to float such that the arithmetic is done on floating point types. An alternate solution would be to cast the results of your arithmetic to int and then negate, that way the arithmetic operations will be done on integral types, which may or may not be preferable, depending on your actual use-case:
#include <stdio.h>
int main(void)
{
unsigned int a = 10;
unsigned int b = 20;
unsigned int c = 30;
float d = -(int)((a*b)*(c/3));
printf("d = %f\n", d);
return 0;
}
Your whole calculation will be done unsigned so it is the same as
float d = -(2000u);
-2000 in unsigned int (assuming 32bits int) is 4294965295
this gets written in your float d. But as float can not save this exact number it gets saved as 4294965248.
As a rule of thumb you can say that float has a precision of 7 significant base 10 digits.
What is calculated is 2^32 - 2000 and then floating point precision does the rest.
If you instead use 3.0 this changes the types in your calculation as follows
float d = -((a*b)*(c/3.0));
float d = -((unsigned*unsigned)*(unsigned/double));
float d = -((unsigned)*(double));
float d = -(double);
leaving you with the correct negative value.
you need to cast the ints to floats
float d = -((a*b)*(c/3));
to
float d = -(((float)a*(float)b)*((float)c/3.0));
-((a*b)*(c/3)); is all performed in unsigned integer arithmetic, including the unary negation. Unary negation is well-defined for an unsigned type: mathematically the result is modulo 2N where N is the number of bits in unsigned int. When you assign that large number to the float, you encounter some loss of precision; the result, due to its binary magnitude, is the nearest number to the unsigned int that divides 2048.
If you change 3 to 3.0, then c / 3.0 is a double type, and the result of a * b is therefore converted to a double before being multiplied. This double is then assigned to a float, with the precision loss already observed.
So, i am trying to program a function which prints a given float number (n) in its (mantissa * 2^exponent) format. I was abled to get the sign and the exponent, but not the mantissa (whichever the number is, mantissa is always equal to 0.000000). What I have is:
unsigned int num = *(unsigned*)&n;
unsigned int m = num & 0x007fffff;
mantissa = *(float*)&m;
Any ideas of what the problem might be?
The C library includes a function that does this exact task, frexp:
int expon;
float mant = frexpf(n, &expon);
printf("%g = %g * 2^%d\n", n, mant, expon);
Another way to do it is with log2f and exp2f:
if (n == 0) {
mant = 0;
expon = 0;
} else {
expon = floorf(log2f(fabsf(n)));
mant = n * exp2f(-expon);
}
These two techniques are likely to give different results for the same input. For instance, on my computer the frexpf technique describes 4 as 0.5 × 23 but the log2f technique describes 4 as 1 × 22. Both are correct, mathematically speaking. Also, frexp will give you the exact bits of the mantissa, whereas log2f and exp2f will probably round off the last bit or two.
You should know that *(unsigned *)&n and *(float *)&m violate the rule against "type punning" and have undefined behavior. If you want to get the integer with the same bit representation as a float, or vice versa, use a union:
union { uint32_t i; float f; } u;
u.f = n;
num = u.i;
(Note: This use of unions is well-defined in C since roughly 2003, but, due to the C++ committee's long-standing habit of not paying sufficient attention to changes going into C, it is not officially well-defined in C++.)
You should also know IEEE floating-point numbers use "biased" exponents. When you initialize a float variable's mantissa field but leave its exponent field at zero, that gives you the representation of a number with a large negative exponent: in other words, a number so small that printf("%f", n) will print it as zero. Whenever printf("%f", variable) prints zero, change %f to %g or %a and rerun the program before assuming that variable actually is zero.
You are stripping off the bits of the exponent, leaving 0. An exponent of 0 is special, it means the number is denormalized and is quite small, at the very bottom of the range of representable numbers. I think you'd find if you looked closely that your result isn't quite exactly zero, just so small that you have trouble telling the difference.
To get a reasonable number for the mantissa, you need to put an appropriate exponent back in. If you want a mantissa in the range of 1.0 to 2.0, you need an exponent of 0, but adding the bias means you really need an exponent of 127.
unsigned int m = (num & 0x007fffff) | (127 << 23);
mantissa = *(float*)&m;
If you'd rather have a fully integer mantissa you need an exponent of 23, biased it becomes 150.
unsigned int m = (num & 0x007fffff) | ((23+127) << 23);
mantissa = *(float*)&m;
In addition to zwol's remarks: if you want to do it yourself you have to acquire some knowledge about the innards of an IEEE-754 float. Once you have done so you can write something like
#include <stdlib.h>
#include <stdio.h>
#include <math.h> // for testing only
typedef union {
float value;
unsigned int bits; // assuming 32 bit large ints (better: uint32_t)
} ieee_754_float;
// clang -g3 -O3 -W -Wall -Wextra -Wpedantic -Weverything -std=c11 -o testthewest testthewest.c -lm
int main(int argc, char **argv)
{
unsigned int m, num;
int exp; // the exponent can be negative
float n, mantissa;
ieee_754_float uf;
// neither checks nor balances included!
if (argc == 2) {
n = atof(argv[1]);
} else {
exit(EXIT_FAILURE);
}
uf.value = n;
num = uf.bits;
m = num & 0x807fffff; // extract mantissa (i.e.: get rid of sign bit and exponent)
num = num & 0x7fffffff; // full number without sign bit
exp = (num >> 23) - 126; // extract exponent and subtract bias
m |= 0x3f000000; // normalize mantissa (add bias)
uf.bits = m;
mantissa = uf.value;
printf("n = %g, mantissa = %g, exp = %d, check %g\n", n, mantissa, exp, mantissa * powf(2, exp));
exit(EXIT_SUCCESS);
}
Note: the code above is one of the quick&dirty(tm) species and is not meant for production. It also lacks handling for subnormal (denormal) numbers, a thing you must include. Hint: multiply the mantissa with a large power of two (e.g.: 2^25 or in that ballpark) and adjust the exponent accordingly (if you took the value of my example subtract 25).