In C programming, I find a weird problem, which counters my intuition. When I declare a integer as the INT_MAX (2147483647, defined in the limits.h) and implicitly convert it to a float value, it works fine, i.e., the float value is same with the maximum integer. And then, I convert the float back to an integer, something interesting happens. The new integer becomes the minimum integer (-2147483648).
The source codes look as below:
int a = INT_MAX;
float b = a; // b is correct
int a_new = b; // a_new becomes INT_MIN
I am not sure what happens when the float number b is converted to the integer a_new. So, is there any reasonable solution to find the maximum value which can be switched forth and back between integer and float type?
PS: The value of INT_MAX - 100 works fine, but this is just an arbitrary workaround.
This answer assumes that float is an IEEE-754 single precision float encoded as 32-bits, and that an int is 32-bits. See this Wikipedia article for more information about IEEE-754.
Floating point numbers only have 24-bits of precision, compared with 32-bits for an int. Therefore int values from 0 to 16777215 have an exact representation as floating point numbers, but numbers greater than 16777215 do not necessarily have exact representations as floats. The following code demonstrates this fact (on systems that use IEEE-754).
for ( int a = 16777210; a < 16777224; a++ )
{
float b = a;
int c = b;
printf( "a=%d c=%d b=0x%08x\n", a, c, *((int*)&b) );
}
The expected output is
a=16777210 c=16777210 b=0x4b7ffffa
a=16777211 c=16777211 b=0x4b7ffffb
a=16777212 c=16777212 b=0x4b7ffffc
a=16777213 c=16777213 b=0x4b7ffffd
a=16777214 c=16777214 b=0x4b7ffffe
a=16777215 c=16777215 b=0x4b7fffff
a=16777216 c=16777216 b=0x4b800000
a=16777217 c=16777216 b=0x4b800000
a=16777218 c=16777218 b=0x4b800001
a=16777219 c=16777220 b=0x4b800002
a=16777220 c=16777220 b=0x4b800002
a=16777221 c=16777220 b=0x4b800002
a=16777222 c=16777222 b=0x4b800003
a=16777223 c=16777224 b=0x4b800004
Of interest here is that the float value 0x4b800002 is used to represent the three int values 16777219, 16777220, and 16777221, and thus converting 16777219 to a float and back to an int does not preserve the exact value of the int.
The two floating point values that are closest to INT_MAX are 2147483520 and 2147483648, which can be demonstrated with this code
for ( int a = 2147483520; a < 2147483647; a++ )
{
float b = a;
int c = b;
printf( "a=%d c=%d b=0x%08x\n", a, c, *((int*)&b) );
}
The interesting parts of the output are
a=2147483520 c=2147483520 b=0x4effffff
a=2147483521 c=2147483520 b=0x4effffff
...
a=2147483582 c=2147483520 b=0x4effffff
a=2147483583 c=2147483520 b=0x4effffff
a=2147483584 c=-2147483648 b=0x4f000000
a=2147483585 c=-2147483648 b=0x4f000000
...
a=2147483645 c=-2147483648 b=0x4f000000
a=2147483646 c=-2147483648 b=0x4f000000
Note that all 32-bit int values from 2147483584 to 2147483647 will be rounded up to a float value of 2147483648. The largest int value that will round down is 2147483583, which the same as (INT_MAX - 64) on a 32-bit system.
One might conclude therefore that numbers below (INT_MAX - 64) will safely convert from int to float and back to int. But that is only true on systems where the size of an int is 32-bits, and a float is encoded per IEEE-754.
Related
EDIT: After some discussion in the comments it came out that because of a luck of knowledge in how floating point numbers are implemented in C, I asked something different from what I meant to ask.
I wanted to use (do operations with) integers larger than those I can have with unsigned long long (that for me is 8 bytes), possibly without recurring to arrays or bigint libraries. Since my long double is 16 bytes, I thought it could've been possible by just switching type. It came out that even though it is possible to represent larger integers, you can't do operations -with these larger long double integers- without losing precision. So it's not possible to achieve what I wanted to do. Actually, as stated in the comments, it is not possible for me. But in general, wether it is possible or not depends on the floating point characteristics of your long double.
// end of EDIT
I am trying to understand what's the largest integer that I can store in a long double.
I know it depends on environment which the program is built in, but I don't know exactly how. I have a sizeof(long double) == 16 for what is worth.
Now in this answer they say that the the maximum value for a 64-bit double should be 2^53, which is around 9 x 10^15, and exactly 9007199254740992.
When I run the following program, it just works:
#include <stdio.h>
int main() {
long double d = 9007199254740992.0L, i;
printf("%Lf\n", d);
for(i = -3.0; i < 4.0; i++) {
printf("%.Lf) %.1Lf\n", i, d+i);
}
return 0;
}
It works even with 11119007199254740992.0L that is the same number with four 1s added at the start. But when I add one more 1, the first printf works as expected, while all the others show the same number of the first print.
So I tried to get the largest value of my long double with this program
#include <stdio.h>
#include <math.h>
int main() {
long double d = 11119007199254740992.0L, i;
for(i = 0.0L; d+i == d+i-1.0; i++) {
if( !fmodl(i, 10000.0L) ) printf("%Lf\n", i);
}
printf("%.Lf\n", i);
return 0;
}
But it prints 0.
(Edit: I just realized that I needed the condition != in the for)
Always in the same answer, they say that the largest possible value of a double is DBL_MAX or approximately 1.8 x 10^308.
I have no idea of what does it mean, but if I run
printf("%e\n", LDBL_MAX);
I get every time a different value that is always around 6.9 x 10^(-310).
(Edit: I should have used %Le, getting as output a value around 1.19 x 10^4932)
I took LDBL_MAX from here.
I also tried this one
printf("%d\n", LDBL_MAX_10_EXP);
That gives the value 4932 (which I also found in this C++ question).
Since we have 16 bytes for a long double, even if all of them were for the integer part of the type, we would be able to store numbers till 2^128, that is around 3.4 x 10^38. So I don't get what 308, -310 and 4932 are supposed to mean.
Is someone able to tell me how can I find out what's the largest integer that I can store as long double?
Inasmuch as you express in comments that you want to use long double as a substitute for long long to obtain increased range, I assume that you also require unit precision. Thus, you are asking for the largest number representable by the available number of mantissa digits (LDBL_MANT_DIG) in the radix of the floating-point representation (FLT_RADIX). In the very likely event that FLT_RADIX == 2, you can compute that value like so:
#include <float.h>
#include <math.h>
long double get_max_integer_equivalent() {
long double max_bit = ldexpl(1, LDBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);
}
The ldexp family of functions scale floating-point values by powers of 2, analogous to what the bit-shift operators (<< and >>) do for integers, so the above is similar to
// not reliable for the purpose!
unsigned long long max_bit = 1ULL << (DBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);
Inasmuch as you suppose that your long double provides more mantissa digits than your long long has value bits, however, you must assume that bit shifting would overflow.
There are, of course, much larger values that your long double can express, all of them integers. But they do not have unit precision, and thus the behavior of your long double will diverge from the expected behavior of integers when its values are larger. For example, if long double variable d contains a larger value then at least one of d + 1 == d and d - 1 == d will likely evaluate to true.
You can print the maximum value on your machine using limits.h, the value is ULLONG_MAX
In https://www.geeksforgeeks.org/climits-limits-h-cc/ is a C++ example.
The format specifier for printing unsigned long long with printf() is %llu for printing long double it is %Lf
printf("unsigned long long int: %llu ",(unsigned long long) ULLONG_MAX);
printf("long double: %Lf ",(long double) LDBL_MAX);
https://www.tutorialspoint.com/format-specifiers-in-c
Is also in Printing unsigned long long int Value Type Returns Strange Results
Assuming you mean "stored without loss of information", LDBL_MANT_DIG gives the number of bits used for the floating-point mantissa, so that's how many bits of an integer value that can be stored without loss of information.*
You'd need 128-bit integers to easily determine the maximum integer value that can be held in a 128-bit float, but this will at least emit the hex value (this assumes unsigned long long is 64 bits - you can use CHAR_BIT and sizeof( unsigned long long ) to get a portable answer):
#include <stdio.h>
#include <float.h>
#include <limits.h>
int main( int argc, char **argv )
{
int tooBig = 0;
unsigned long long shift = LDBL_MANT_DIG;
if ( shift >= 64 )
{
tooBig = 1;
shift -= 64;
}
unsigned long long max = ( 1ULL << shift ) - 1ULL;
printf( "Max integer value: 0x" );
// don't emit an extraneous zero if LDBL_MANT_DIG is
// exactly 64
if ( max )
{
printf( "%llx", max );
}
if ( tooBig )
{
printf( "%llx", ULLONG_MAX );
}
printf( "\n" );
return( 0 );
}
* - pedantically, it's the number of digits in FLT_RADIX base, but that base is almost certainly 2.
I cannot figure out how to convert the value of a referenced float pointer when it is referenced from an integer casted into a float pointer. I'm sorry if I'm wording this incorrectly. Here is an example of what I mean:
#include <stdio.h>
main() {
int i;
float *f;
i = 1092616192;
f = (float *)&i;
printf("i is %d and f is %f\n", i, *f);
}
the output for f is 10. How did I get that result?
Normally, the value of 1092616192 in hexadecimal is 0x41200000.
In floating-point, that will give you:
sign = positive (0b)
exponent = 130, 2^3 (10000010b)
significand = 2097152, 1.25 (01000000000000000000000b)
2^3*1.25
= 8 *1.25
= 10
To explain the exponent part uses an offset encoding, so you have to subtract 127 from it to get the real value. 130 - 127 = 3. And since this is a binary encoding, we use 2 as the base. 2 ^ 3 = 8.
To explain the significand part, you start with an invisible 'whole' value of 1. the uppermost (leftmost) bit is half of that, 0.5. The next bit is half of 0.5, 0.25. Because only the 0.25 bit and the default '1' bit is set, the significand represents 1 + 0.25 = 1.25.
What you are trying to do is called type-punning. It should be done via a union, or using memcpy() and is only meaningful on an architecture where sizeof(int) == sizeof(float) without padding bits. The result is highly dependent on the architecture: byte ordering and floating point representation will affect the reinterpreted value. The presence of padding bits would invoke undefined behavior as the representation of float 15.0 could be a trap value for type int.
Here is how you get the number corresponding to 15.0:
#include <stdio.h>
int main(void) {
union {
float f;
int i;
unsigned int u;
} u;
u.f = 15;
printf("re-interpreting the bits of float %.1f as int gives %d (%#x in hex)\n",
u.f, u.i, u.u);
return 0;
}
output on an Intel PC:
re-interpreting the bits of float 15.0 as int gives 1097859072 (0x41700000 in hex)
You are trying to predict the consequence of an undefined activity - it depends on a lot of random things, and on the hardware and OS you are using.
Basically, what you are doing is throwing a glass against the wall and getting a certain shard. Now you are asking how to get a differently formed shard. well, you need to throw the glass differently against the wall...
I'm learning c, and am confused as my code seems to evaluate ( 1e16 - 1 >= 1e16 ) as true when it should be false. My code is below, it returns
9999999999999999 INVALIDBIG\n
when I would expect it not to return anything. I thought any problems with large numbers could be avoided by using long long.
int main(void)
{
long long z;
z = 9999999999999999;
if ( z >= 1e16 || z < 0 )
{
printf("%lli INVALIDBIG\n",z);
}
}
1e16 is a double type literal value, and floats/doubles can be imprecise for decimal arithmetic/comparison (just one of many common examples: decimal 0.2). Its going to cast the long-long z upwards to double for the comparison, and I'm guessing the standard double representation can't store the precision needed (maybe someone else can demonstrate the binary mantissa/sign representations)
Try changing the 1e16 to (long double)1e16, it doesn't then print out your message. (update: or, as the other question-commenter added, change 1e16 to an integer literal)
The doubles and floats can hold limited number of digits. In your case the double numbers with values 9999999999999999 and 1e16 have identical 8 bytes of hex representation. You can check them byte by byte:
long long z = 9999999999999999;
double dz1 = z;
double dz2 = 1e16;
/* prints 0 */
printf("memcmp: %d\n", memcmp(&dz1, &dz2, sizeof(double)));
So, they are equal.
Smaller integers can be stored in double with perfect precision. For example, see Double-precision floating-point format or biggest integer that can be stored in a double
The maximum integer that can be converted to double exactly is 253 (9007199254740992).
I was trying out few examples on do's and dont's of typecasting. I could not understand why the following code snippets failed to output the correct result.
/* int to float */
#include<stdio.h>
int main(){
int i = 37;
float f = *(float*)&i;
printf("\n %f \n",f);
return 0;
}
This prints 0.000000
/* float to short */
#include<stdio.h>
int main(){
float f = 7.0;
short s = *(float*)&f;
printf("\n s: %d \n",s);
return 0;
}
This prints 7
/* From double to char */
#include<stdio.h>
int main(){
double d = 3.14;
char ch = *(char*)&d;
printf("\n ch : %c \n",ch);
return 0;
}
This prints garbage
/* From short to double */
#include<stdio.h>
int main(){
short s = 45;
double d = *(double*)&s;
printf("\n d : %f \n",d);
return 0;
}
This prints 0.000000
Why does the cast from float to int give the correct result and all the other conversions give wrong results when type is cast explicitly?
I couldn't clearly understand why this typecasting of (float*) is needed instead of float
int i = 10;
float f = (float) i; // gives the correct op as : 10.000
But,
int i = 10;
float f = *(float*)&i; // gives a 0.0000
What is the difference between the above two type casts?
Why cant we use:
float f = (float**)&i;
float f = *(float*)&i;
In this example:
char ch = *(char*)&d;
You are not casting from double to a char. You are casting from a double* to a char*; that is, you are casting from a double pointer to a char pointer.
C will convert floating point types to integer types when casting the values, but since you are casting pointers to those values instead, there is no conversion done. You get garbage because floating point numbers are stored very differently from fixed point numbers.
Read about the representation of floating point numbers in systems. Its not the way you're expecting it to be. Cast made through (float *) in your first snippet read the most significant first 16 bits. And if your system is little endian, there will be always zeros in most significant bits if the value containing in the int type variable is lesser than 2^16.
If you need to convert int to float, the conversion is straight, because the promotion rules of C.
So, it is enough to write:
int i = 37;
float f = i;
This gives the result f == 37.0.
However, int the cast (float *)(&i), the result is an object of type "pointer to float".
In this case, the address of "pointer to integer" &i is the same as of the the "pointer to float" (float *)(&i). However, the object pointed by this last object is a float whose bits are the same as of the object i, which is an integer.
Now, the main point in this discussion is that the bit-representation of objects in memory is very different for integers and for floats.
A positive integer is represented in explicit form, as its binary mathematical expression dictates.
However, the floating point numbers have other representation, consisting of mantissa and exponent.
So, the bits of an object, when interpreted as an integer, have one meaning, but the same bits, interpreted as a float, have another very different meaning.
The better question is, why does it EVER work. You see, when you do
typedef int T;//replace with whatever
typedef double J;//replace with whatever
T s = 45;
J d = *(J*)(&s);
You are basically telling the compiler (get the T* address of s, reintepret what it points to as J, and then get that value). No casting of the value (changing the bytes) actually happens. Sometimes, by luck, this is the same (low value floats will have an exponential of 0, so the integer interpretation may be the same) but often times, this'll be garbage, or worse, if the sizes are not the same (like casting to double from char) you can read unallocated data (heap corruption (sometimes)!).
What's going on here:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %lf\n", pow(17, 12));
printf("17^13 = %lf\n", pow(17, 13));
printf("17^14 = %lf\n", pow(17, 14));
}
I get this output:
17^12 = 582622237229761.000000
17^13 = 9904578032905936.000000
17^14 = 168377826559400928.000000
13 and 14 do not match with wolfram alpa cf:
12: 582622237229761.000000
582622237229761
13: 9904578032905936.000000
9904578032905937
14: 168377826559400928.000000
168377826559400929
Moreover, it's not wrong by some strange fraction - it's wrong by exactly one!
If this is down to me reaching the limits of what pow() can do for me, is there an alternative that can calculate this? I need a function that can calculate x^y, where x^y is always less than ULLONG_MAX.
pow works with double numbers. These represent numbers of the form s * 2^e where s is a 53 bit integer. Therefore double can store all integers below 2^53, but only some integers above 2^53. In particular, it can only represent even numbers > 2^53, since for e > 0 the value is always a multiple of 2.
17^13 needs 54 bits to represent exactly, so e is set to 1 and hence the calculated value becomes even number. The correct value is odd, so it's not surprising it's off by one. Likewise, 17^14 takes 58 bits to represent. That it too is off by one is a lucky coincidence (as long as you don't apply too much number theory), it just happens to be one off from a multiple of 32, which is the granularity at which double numbers of that magnitude are rounded.
For exact integer exponentiation, you should use integers all the way. Write your own double-free exponentiation routine. Use exponentiation by squaring if y can be large, but I assume it's always less than 64, making this issue moot.
The numbers you get are too big to be represented with a double accurately. A double-precision floating-point number has essentially 53 significant binary digits and can represent all integers up to 2^53 or 9,007,199,254,740,992.
For higher numbers, the last digits get truncated and the result of your calculation is rounded to the next number that can be represented as a double. For 17^13, which is only slightly above the limit, this is the closest even number. For numbers greater than 2^54 this is the closest number that is divisible by four, and so on.
If your input arguments are non-negative integers, then you can implement your own pow.
Recursively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
if (y == 0)
return 1;
if (y == 1)
return x;
return pow(x,y/2)*pow(x,y-y/2);
}
Iteratively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y--)
res *= x;
return res;
}
Efficiently:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y > 0)
{
if (y & 1)
res *= x;
y >>= 1;
x *= x;
}
return res;
}
A small addition to other good answers: under x86 architecture there is usually available x87 80-bit extended format, which is supported by most C compilers via the long double type. This format allows to operate with integer numbers up to 2^64 without gaps.
There is analogue of pow() in <math.h> which is intended for operating with long double numbers - powl(). It should also be noticed that the format specifier for the long double values is other than for double ones - %Lf. So the correct program using the long double type looks like this:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %Lf\n", powl(17, 12));
printf("17^13 = %Lf\n", powl(17, 13));
printf("17^14 = %Lf\n", powl(17, 14));
}
As Stephen Canon noted in comments there is no guarantee that this program should give exact result.