Is a conversion from an int to a float always possible in C without the float becoming one of the special values like +Inf or -Inf?
AFAIK there is is no upper limit on the range of int.
I think a 128 bit int would cause an issue for a platform with an IEEE754 float as that has an upper value of around the 127th power of 2.
Short answer to your question: no, it is not always possible.
But it is worthwhile to go a little bit more into details. The following paragraph shows what the standard says about integer to floating-point conversions (online C11 standard draft):
6.3.1.4 Real floating and integer
2) When a value of integer type is converted to a real floating type,
if the value being converted can be represented exactly in the new
type, it is unchanged. If the value being converted is in the range of
values that can be represented but cannot be represented exactly, the
result is either the nearest higher or nearest lower representable
value, chosen in an implementation-defined manner. If the value being
converted is outside the range of values that can be represented, the
behavior is undefined. ...
So many integer values may be converted exactly. Some integer values may lose precision, yet a conversion is at least possible. For some values, however, the behaviour might be undefined (if, for example, an integer value would not be able to be represented with the maximum exponent of the float value). But actually I cannot assume a case where this will happen.
Is it always possible to convert an int to a float?
Reasonably - yes. An int will always convert to a finite float. The conversion may lose some precision for great int values.
Yet for the pedantic, an odd compiler could have trouble.
C allows for excessively wide int, not just 16, 32 or 64 bit ones and float could have a limit range, as small as 1e37.
It is not the upper range of int or INT_MAX that should be of concern. It is the lower end. INT_MIN which often has +1 greater magnitude than INT_MAX.
A 124 bit int min value could be about -1.06e37, so that does exceed the minimal float range.
With the common binary32 float, an int would need to be more than 128 bits to cause a float infinity.
So what test is needed to detect this rare situation?
Form an exact power-of-2 limit and perform careful math to avoid overflow or imprecision.
#if -INT_MAX == INT_MIN
// rare non 2's complement machine
#define INT_MAX_P1_HALF (INT_MAX/2 + 1)
_Static_assert(FLT_MAX/2 >= INT_MAX_P1_HALF, "non-2's comp.`int` range exceeds `float`");
#else
_Static_assert(-FLT_MAX <= INT_MIN, "2's complement `int` range exceeds `float`");
#endif
The standard only requires floating point representations to include a finite number as large as 1037 (§5.2.4.2.2/12) and does not put any limit on the maximum size of an integer. So if your implementation has 128-bit integers (or even 124-bit integers), it is possible for an integer-to-float conversion to exceed the range of finite representable floating point numbers.
No, it not always possible to convert an int to a float, due to how floats work. 32 bit floats greater than 16777216 (or less than -16777216) need to be even, greater than 33554432 (or less than -33554432) need to be evenly divisibly by 4, greater than 67108864 (or less than -67108864) need to be evenly divisibly by 8, etc. The IEEE-754 float standard defines round to nearest even as the default mode, but other modes exist depending upon implementation.
Also, the largest 128 bit int = 2^128 - 1 is greater than the largest 32 bit float = 2^127 x 1.11111111111111111111111 = 2^127 x (2-2^-23) = 2^127 x (2^1-2^-23) = 2^(127+1) - 2^(127-23) = 2^(127+1)-2^(127-23) = 2^(128) - 2^(104)
Related
I am trying to convert int32 values to float and when i try to convert values above 0x0FFFFF the last decimal pointed is always rounded to nearest value. I know that when a value is not fitting to the destination float member it will be rounded but i need to know which is the limit value for this.
e.g. 11111111 (0x69F6BC7) is printed as 111111112.0 .
The maximum integer value of a float significand is FLT_RADIX/FLT_EPSILON - 1. By “integer value” of a significand, I mean the value when it is scaled so that its lowest bit represents a value of 1.
The value FLT_RADIX/FLT_EPSILON is also representable in float, since it is a power of the radix. FLT_RADIX/FLT_EPSILON + 1 is not representable in float, so converting an integer to float might result in rounding if the integer exceeds FLT_RADIX/FLT_EPSILON in magnitude.
If it is known that INT_MAX exceeds FLT_RADIX/FLT_EPSILON, you can test this for a non-negative int x with (int) (FLT_RADIX/FLT_EPSILON) < x. If it is not known that FLT_RADIX/FLT_EPSILON can be converted to int successfully, more complicated tests may be needed.
Very commonly, C implementations use the IEEE-754 binary32 format, also known as “single precision,” for float. In this format, FLT_RADIX/FLT_EPSILON is 224 = 16,777,216.
These symbols are defined in <float.h>. For double or long double, replace FLT_EPSILON with DBL_EPSILON or LDBL_EPSILON. FLT_RADIX remains unchanged since it is the same for all formats.
Theoretically, a perverse floating-point format might have an abnormally small exponent range that makes FLT_RADIX/FLT_EPSILON - 1 not representable because the significand cannot be scaled high enough. This can be disregarded in practice.
Suppose I have
int i=25;
double j=(double)i;
Is there a chance that j will have values 24.9999999..upto_allowed or 25.00000000..._upto_allowed_minus_one_and_then_1. I remember reading such stuff somehere but not able to recall properly.
In other words:
Is there a case when an integer loses its precision when casted to double?
For small numbers like 25, you are good. For very large (absolute) values of ints on architecture where int is 64 bit (having a value not representable in 53 bits) or more, you will loose the precision.
Double precision floating point number has 53 bits of precision of which Most significant bit is (implicitly) usually 1.
On Platforms where floating point representation is not IEEE-754, answer may be a little different. For more details you can refer chapter 5.2.4.2.2 of C99/C11 specs
An IEEE-754 double has a significand precision of 53-bits. This means it can store all signed integers within the range 2^53 and -2^53.
Because int typically has 32 bits on most compilers/architectures, double will usually be able to handle int.
#Mohit Jain answer is good for practicle coding.
By the C spec, DBL_DIG or FLT_RADIX/DBL_MANT_DIG and INT_MAX/INT_MIN are important values.
DBL_DIG in the max decimal digits a number can have that when converted to double and back will certainly have the same value. It is at least 10. So a whole number like 9,999,999,999 can certainly convert to a double and back without losing precision. Possible larger values can successfully round-trip too.
The real round-trip problem begin with integer values exceeding +/-power(FLT_RADIX, DBL_MANT_DIG). FLT_RADIX is the floating point base (and is overwhelmingly 2) and DBL_MANT_DIG is the "number of base-FLT_RADIX digits in the floating-point significand" such as 53 with IEEE-754 binary64.
Of course an int has the range [INT_MIN ... INT_MAX]. The range must be at least [-32767...+32,767].
When, mathematically, power(FLT_RADIX, DBL_MANT_DIG) >= INT_MAX, there is no conversion problems. This applies to all conforming C compilers.
This question might be very basic but i post here only after days of googling and for my proper basic understanding of signed integers in C.
Actually some say signed int has range
-32767 to 32767 and others say it has range
-32768 to 32767
Let us have int a=5 (signed / let us consider just 1 byte)
*the 1st representation of a=5 is represented as 00000101 as a positive number and a=-5 is represented as 10000101 (so range -32767 to 32767 justified)
(here the msb/sign bit is 1/0 the number will be positive/negative and rest(magnitude bits) are unchanged )
*the 2nd representation of a=5 is represented as 00000101 as a positive number and a=-5 is represented as 11111011
(the msb is considered as -128 and the rest of bits are manipulated to obtain -5) (so range -32768 to 32767 justified)
So I confuse between these two things. My doubt is what is the actual range of signed int in c ,1) or 2)
It depends on your environment and typically int can store -2147483648 to 2147483647 if it is 32-bit long and two's complement is used, but C specification says that int can store at least -32767 to 32767.
Quote from N1256 5.2.4.2.1 Sizes of integer types <limits.h>
Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— minimum value for an object of type int
INT_MIN -32767 // −(2 15 − 1)
— maximum value for an object of type int
INT_MAX +32767 // 2 15 − 1`
Today, signed ints are usually done in two's complement notation.
The highest bit is the "sign bit", it is set for all negative numbers.
This means you have seven bits to represent different values.
With the highest bit unset, you can (with 16 bits total) represent the values 0..32767.
With the highest bit set, and because you already have a representation for zero, you can represent the values -1..-32768.
This is, however, implementation-defined, other representations do exist as well. The actual range limits for signed integers on your platform / for your compiler are the ones found in your environment's <limits.h>. That is the only definite authority.
On today's desktop systems, an int is usually 32 or 64 bits wide, for a correspondingly much larger range than the 16-bit 32767 / 32768 you are talking of. So either those people are talking about really old platforms, really old knowledge, embedded systems, or the minimum guaranteed range -- the standard states that INT_MIN must be at least -32767, and INT_MAX be at least +32767, the lowest common denominator.
My doubt is what is the actual range of signed int in c ,1) [-32767 to 32767] or 2) [-32768 to 32767]?
The whole point of C and its advantage of high portability to old and new platforms is that code should not care.
C defines the range of int with 2 macros: INT_MIN and INT_MAX. The C spec specifies:
INT_MIN is -32,767 or less.
INT_MAX is +32,767 or more.
If code needs a 16-bit 2's complement type, use int16_t. If code needs a 32-bit or wider type, use long or int32least_t, etc. Do not code assuming int is something that it is not defined to be.
The value 32767 is the maximum positive value you can represent on a signed 16-bit integer. The C corresponding type is short.
The int type is represented on at least the same number of bytes as short and at most the same number of bytes as long. The size of int on 16-bit processors is 2 bytes (the same as short). On 32-bit and higher architecture, the size of int is 4 bytes (the same as long).
No matter the architecture, the minumum value of int is INT_MIN and the maximum value of int is INT_MAX.
Similar, there are constants to get the minimum and maximum values for short (SHRT_MIN and SHRT_MAX), long, char etc. You don't need to use hardcoded constants or to guess what is the minimum value for int on your system.
The representation #1 is named "sign and magnitude representation". It is a theoretical model that uses the most significant byte to store the sign and the rest of the bytes to store the absolute value of the number. It was used by some early computers, probably because it seemed a natural map of the numbers representation in mathematics. However, it is not natural for binary computers.
The representation #2 is named two's complement. The two's-complement system has the advantage that the fundamental arithmetic operations of addition, subtraction, and multiplication are identical to those for unsigned binary numbers (as long as the inputs are represented in the same number of bits and any overflow beyond those bits is discarded from the result). This is why it is the preferred encoding nowadays.
The C standard specifies the lowest limits for integer values. As it is written in the Standard (5.2.4.2.1 Sizes of integer types )
...Their implementation-defined values shall be equal or greater in
magnitude (absolute value) to those shown, with the same sign.
For objects of type int these lowest limits are
— minimum value for an object of type int
INT_MIN -32767 // −(215 − 1)
— maximum value for an object of type int
INT_MAX +32767 // 215 − 1
For the two's complement representation of integers the number of positive values is one less than the number of negative values. So if only tow bytes are used for representations of objects of type int then INT_MIN will be equal to -32768.
Take into account that 32768 in magnitude is greater than the value used in the Standard. So it satisfies the Standard requirement.
On the other habd for the representation "sign and magnitude" the limits (when 2 bytes are used) will be the same as shown in the Standard that is -32767:32767
So the actual limits used in the implementation depend on the width of integers and their representation.
As we all know when a integer varible exceeds its range it starts from other end that is from negative numbers. for example
int a=2147483648;
printf("%d",a);
OUTPUT:
-2147483648 (as I was expecting)
Now I tried the same for floating points.
for example
float a=3.4e39;//as largest float is 3.4e38
printf("%f",a);
OUTOUT:
1.#INF00 (I was expecting some negative float value)
I didn't get the above output exactly but I know It represents positive infinity.
So my question is simply why it does not start from other end(negative values like integers)?
Floating point numbers are stored in a different format than integer numbers, and don't follow the same over-/under-flowing mechanics.
More specifically, the binary bit-pattern for 2147483648 is 1000000000000000 which in a two's complement system (like the one used on almost all modern computers) is the same as -2147483648.
Most computers today uses IEEE754 format for floating point values, and those are handled quite differently from plain integers.
In IEEE-754, the maximum finite float (binary-32) value is below double value 3.4e39.
IEEE-754 says (for default rounding-direction attribute roundTiesToEven):
(IEEE-754:2008, 4.3.1 Rounding-direction attributes to nearest) "In the following two rounding-direction attributes, an infinitely precise result with magnitude at least
b emax (b − ½ b 1−p) shall round to ∞ with no change in sign; here emax and p are determined by the destination format (see 3.3)"
So in this declaration:
float a=3.4e39;
the conversion yields a positive infinity.
Under IEEE floating point, it's impossible for arithmetic to overflow because the representable range is [-INF,INF] (including the endpoints). As usual, floating point is subject to rounding when the exact value is not representable, and in your case, rounding yields INF.
Other answers have looked at floating point. This answer is about why signed integer values traditionally wrap around. It is not because that is particularly nice behavior. It is because that is what is expected because it is the way it has been done for a long time.
Especially in early hardware, with either discrete logic or very limited chip space, there was a major advantage to using the same adder for signed and unsigned integer addition and subtraction.
Floating point arithmetic was done in software except on special "scientific" computers that cost extra. Floating point numbers are always signed, and, as has been pointed out in other answers, have their own format. There is no signed/unsigned hardware sharing issue.
Common hardware for signed and unsigned integers can be achieved by using 2's complement representation for signed integer types.
What follows is based on 8 bit integers, with each bit pattern represented as 2 hexadecimal digits. Other widths work the same way.
00 through 7f have the same meaning in unsigned and 2's complement, 0 through 127 in that order, the intersection of the two ranges. 80 through ff represent 128 through 255, in that order, for unsigned integers, but represent negative numbers for signed. To make addition the same for both, 80 represents -128, and ff represents -1.
Now see what happens if you add 1 to 7f. For unsigned, it has to increment from 127 to 128. That means the resulting bit pattern is 80, which is also the most negative signed value. The price of sharing an adder is wrap-around at one point in the range.
What do arithmetic underflow and overflow mean in C programming?
Overflow
From http://en.wikipedia.org/wiki/Arithmetic_overflow:
the condition that occurs when a
calculation produces a result that is
greater in magnitude than that which a
given register or storage location can
store or represent.
So, for instance:
uint32_t x = 1UL << 31;
x *= 2; // Overflow!
Note that as #R mentions in a comment below, the C standard suggests:
A computation involving unsigned
operands can never overflow, because a
result that cannot be represented by
the resulting unsigned integer type is
reduced modulo the number that is one
greater than the largest value that
can be represented by the resulting
type.
Of course, this is a fairly idiosyncratic definition of "overflow". Most people would refer to modulo reduction (i.e wrap-around) as "overflow".
Underflow
From http://en.wikipedia.org/wiki/Arithmetic_underflow:
the condition in a computer program that
can occur when the true result of a
floating point operation is smaller in
magnitude (that is, closer to zero)
than the smallest value representable
as a normal floating point number in
the target datatype.
So, for instance:
float x = 1e-30;
x /= 1e20; // Underflow!
Computers use only 0 and 1 to represent data so that the range of values that can be represented is limited. Many computers use 32 bits to store integers, so the largest unsigned integer that can be stored in this case is 2^32 -1 = 4294967295. But the first bit is used to represent the sign, so, in fact, the largest value is 2^31 - 1 = 2147483647.
The situation where an integer outside the allowed range requires more bits than can be stored is called an overflow.
Similarly, with real numbers, an exponent that is too small to be stored causes an underflow.
int, the most common data type in C, is a 32-bit data type. This means that each int is given 32 bits in memory. If I had the variable
int a = 2;
that would actually be represented in memory as a 32-bit binary number:
00000000000000000000000000000010.
If you have two binary numbers such as
10000000000000000000000000000000
and
10000000000000000000000000000000,
their sum would be 100000000000000000000000000000000, which is 33 bits long. However, the computer only takes the 32 least significant bits, which are all 0. In this case the computer recognizes that the sum is greater than what can be stored in 32 bits, and gives an overflow error.
An underflow is basically the same thing happening in the opposite direction. The floating-point standard used for C allows for 23 bits after the decimal place; if the number has precision beyond this point it won't be able to store those bits. This results in an underflow error and/or loss of precision.
underflow depends exclusively upon the given algorithm and the given input data,and hence there is no direct control by the programmer .Overflow on the other hand, depends upon the arbitrary choice of the programmer for the amount of memory space reserved for each stack ,and this choice does influence the number of times overflow may occur