Hi I have two questions:
uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.
Direct casting apparently doesn't work due to how double is defined.
Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.
As for an example:
//operation for convertion I used:
double sampleRate = (
(union { double i; uint64_t sampleRate; })
{ .i = r23u.outputSampleRate}
).sampleRate;
//the following are printouts on command line
// double uint64_t
//printed by %.16llx %.16llx
outputSampleRate 0x41886a0000000000 0x41886a0000000000 sampleRate
//printed by %f %llu
outputSampleRate 51200000.000000 4722140757530509312 sampleRate
So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong.
Thank you.
uint64_t vs double, which has a higher range limit for covering positive numbers?
uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.
Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.
IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.
How to convert double into uint64_t if only the whole number part of double is needed
You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:
double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;
my_uint = my_double;
my_other_uint = (uint64_t) my_double;
Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.
The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.
double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.
This code for example:
#include <stdio.h>
2 int main()
3 {
4 for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5 printf("%lf\n", i);
6 return 0;
7 }
gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.
However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.
Related
#include <stdio.h>
int main() {
unsigned long long int c = 9999999999999999999U / 2;
unsigned long long int d = 9999999999999999999U * 0.5;
unsigned long long int e = 9999999999999999999U >> 1;
printf("%llu\n%llu\n%llu\n", c, d, e);
return 0;
}
So the output of that is:
4999999999999999999
5000000000000000000
4999999999999999999
Why is there a difference when multiplied by 0.5?
and why doesn't this difference show up when the numbers are small?
In the case of d, 9999999999999999999 is promoted to a double, which if your C implementation uses IEEE 754 doubles, would be converted to 10000000000000000000 (if I did my calculations correctly) because they only have 53 bits available in the significand, one of which is an implied 1. Multiplying 10000000000000000000 by 0.5 is 5000000000000000000. Floating point is weird. Read up on it at https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html.
9999999999999999999U is a large number. It requires 64 bits to represent in binary. Type unsigned long long int is guaranteed by the C Standard to have at least 64 value bits, so depending on the actual range of smaller integer types, it is a integer constant with type unsigned int, unsigned long int or at most unsigned long long int.
The expressions 9999999999999999999U / 2 and 9999999999999999999U >> 1 are thus fully defined and evaluate to 4999999999999999999, typically at compile time through constant folding, with the same type. This value can be stored into c and e and output correctly by printf with a format %llu as expected.
Conversely 9999999999999999999U * 0.5 (or similarly 9999999999999999999U / 2.0) is evaluated as a floating point expression: (double)9999999999999999999U * 0.5, the floating point result of type double is converted an unsigned long long int when assigned to d.
The double type is only guaranteed to provide enough precision for converting numbers up to 10 decimal digits without loss, a lot less than required for your number. Most C implementations use IEEE-754 representation for the double type that has exactly 53 bits of precision. The value 9999999999999999999 is thus rounded as 1E19 when converted to a double. Multiplying by 0.5 or dividing by 2.0 is performed exactly as it only changes the binary exponent part. The result 5E18 is converted to unsigned long long int and printed as 5000000000000000000 as you see on your system.
The differences are explained with type propagation.
First example, dividing integer by integer. Dividing by two and right-shifting are equivalent here, they are done on the operands as-is.
Second example, dividing integer by double. Here, the compiler will first convert the integer operand to a double (which only guarantees ten decimal digits, I think) and then performs the division. In order to store the result in an integer again, it is truncated.
I hope that illustrates that there are different operations going on that are caused by different types of operands, even though they seem to be the similar from a mathematical point of view.
Is a conversion from an int to a float always possible in C without the float becoming one of the special values like +Inf or -Inf?
AFAIK there is is no upper limit on the range of int.
I think a 128 bit int would cause an issue for a platform with an IEEE754 float as that has an upper value of around the 127th power of 2.
Short answer to your question: no, it is not always possible.
But it is worthwhile to go a little bit more into details. The following paragraph shows what the standard says about integer to floating-point conversions (online C11 standard draft):
6.3.1.4 Real floating and integer
2) When a value of integer type is converted to a real floating type,
if the value being converted can be represented exactly in the new
type, it is unchanged. If the value being converted is in the range of
values that can be represented but cannot be represented exactly, the
result is either the nearest higher or nearest lower representable
value, chosen in an implementation-defined manner. If the value being
converted is outside the range of values that can be represented, the
behavior is undefined. ...
So many integer values may be converted exactly. Some integer values may lose precision, yet a conversion is at least possible. For some values, however, the behaviour might be undefined (if, for example, an integer value would not be able to be represented with the maximum exponent of the float value). But actually I cannot assume a case where this will happen.
Is it always possible to convert an int to a float?
Reasonably - yes. An int will always convert to a finite float. The conversion may lose some precision for great int values.
Yet for the pedantic, an odd compiler could have trouble.
C allows for excessively wide int, not just 16, 32 or 64 bit ones and float could have a limit range, as small as 1e37.
It is not the upper range of int or INT_MAX that should be of concern. It is the lower end. INT_MIN which often has +1 greater magnitude than INT_MAX.
A 124 bit int min value could be about -1.06e37, so that does exceed the minimal float range.
With the common binary32 float, an int would need to be more than 128 bits to cause a float infinity.
So what test is needed to detect this rare situation?
Form an exact power-of-2 limit and perform careful math to avoid overflow or imprecision.
#if -INT_MAX == INT_MIN
// rare non 2's complement machine
#define INT_MAX_P1_HALF (INT_MAX/2 + 1)
_Static_assert(FLT_MAX/2 >= INT_MAX_P1_HALF, "non-2's comp.`int` range exceeds `float`");
#else
_Static_assert(-FLT_MAX <= INT_MIN, "2's complement `int` range exceeds `float`");
#endif
The standard only requires floating point representations to include a finite number as large as 1037 (§5.2.4.2.2/12) and does not put any limit on the maximum size of an integer. So if your implementation has 128-bit integers (or even 124-bit integers), it is possible for an integer-to-float conversion to exceed the range of finite representable floating point numbers.
No, it not always possible to convert an int to a float, due to how floats work. 32 bit floats greater than 16777216 (or less than -16777216) need to be even, greater than 33554432 (or less than -33554432) need to be evenly divisibly by 4, greater than 67108864 (or less than -67108864) need to be evenly divisibly by 8, etc. The IEEE-754 float standard defines round to nearest even as the default mode, but other modes exist depending upon implementation.
Also, the largest 128 bit int = 2^128 - 1 is greater than the largest 32 bit float = 2^127 x 1.11111111111111111111111 = 2^127 x (2-2^-23) = 2^127 x (2^1-2^-23) = 2^(127+1) - 2^(127-23) = 2^(127+1)-2^(127-23) = 2^(128) - 2^(104)
Suppose I have
int i=25;
double j=(double)i;
Is there a chance that j will have values 24.9999999..upto_allowed or 25.00000000..._upto_allowed_minus_one_and_then_1. I remember reading such stuff somehere but not able to recall properly.
In other words:
Is there a case when an integer loses its precision when casted to double?
For small numbers like 25, you are good. For very large (absolute) values of ints on architecture where int is 64 bit (having a value not representable in 53 bits) or more, you will loose the precision.
Double precision floating point number has 53 bits of precision of which Most significant bit is (implicitly) usually 1.
On Platforms where floating point representation is not IEEE-754, answer may be a little different. For more details you can refer chapter 5.2.4.2.2 of C99/C11 specs
An IEEE-754 double has a significand precision of 53-bits. This means it can store all signed integers within the range 2^53 and -2^53.
Because int typically has 32 bits on most compilers/architectures, double will usually be able to handle int.
#Mohit Jain answer is good for practicle coding.
By the C spec, DBL_DIG or FLT_RADIX/DBL_MANT_DIG and INT_MAX/INT_MIN are important values.
DBL_DIG in the max decimal digits a number can have that when converted to double and back will certainly have the same value. It is at least 10. So a whole number like 9,999,999,999 can certainly convert to a double and back without losing precision. Possible larger values can successfully round-trip too.
The real round-trip problem begin with integer values exceeding +/-power(FLT_RADIX, DBL_MANT_DIG). FLT_RADIX is the floating point base (and is overwhelmingly 2) and DBL_MANT_DIG is the "number of base-FLT_RADIX digits in the floating-point significand" such as 53 with IEEE-754 binary64.
Of course an int has the range [INT_MIN ... INT_MAX]. The range must be at least [-32767...+32,767].
When, mathematically, power(FLT_RADIX, DBL_MANT_DIG) >= INT_MAX, there is no conversion problems. This applies to all conforming C compilers.
I recently wrote a block of code that takes as an input an 8 digit hexadecimal number from the user, transforms it into an integer and then converts it into a float. To go from integer to float I use the following:
int myInt;
float myFloat;
myFloat = *(float *)&myInt;
printf("%g", myFloat);
It works perfectly for small numbers. But when the user inputs hexadecimal numbers such as:
0x0000ffff
0x7eeeeeef
I get that myInt = -2147483648 and that myFloat = -0. I know that the number I get for myInt is the smallest possible number that can be stored in an int variable in C.
Because of this problem, the input range of my program is extremely limited. Does anyone have any advice on how I could expand the range of my code so that it could handle a number as big as:
0xffffffff
Thank you so much for any help you may give me!
The correct way to get the value transferred as accurately as float will allow is:
float myFloat = myInt;
If you want better accuracy, use double instead of float.
What you're doing is trying to reinterpret the bit pattern for the int as if it was a float, which is not a good idea. There are hexadecimal floating-point constants and conversions available in C99 and later. (However, if that's what you are trying, your code in the question is correct — your problem appears to be in converting hex to integer.)
If you get -2147483648 from 0x0000FFFF (or from 0x7EEEFFFF), there is a bug in your conversion code. Fix that before doing anything else. How are you doing the hex to integer conversion? Using strtol() is probably a good way (and sscanf()
and friends is also possible), but be careful about overflows.)
Does anyone have any advice on how I could expand the range of my code so that it could
handle a number as big as 0xffffffff
You can't store 0xffffffff in a 32-bit int; the largest positive hex value you can store in a 32 bit int is 0x7FFFFFFF or (2^31 -1) or 2147483647, but the negative range is -2^31 or -2147483648,
The ranges are due to obvious limitations in the number of bits available and the 2's complement system.
Use an unsigned int if you want 0xffffffff.
I have this structure which I want to write to a file:
typedef struct
{
char* egg;
unsigned long sausage;
long bacon;
double spam;
} order;
This file must be binary and must be readable by any machine that has a
C99 compiler.
I looked at various approaches to this matter such as ASN.1, XDR, XML,
ProtocolBuffers and many others, but none of them fit my requirements:
small
simple
written in C
I decided then to make my own data protocol. I could handle the
following representations of integer types:
unsigned
signed in one's complement
signed in two's complement
signed in sign and magnitude
in a valid, simple and clean way (impressive, no?). However, the
real types are being a pain now.
How should I read float and double from a byte stream? The standard
says that bitwise operators (at least &, |, << and >>) are for
integer types only, which left me without hope. The only way I could
think was:
int sign;
int exponent;
unsigned long mantissa;
order my_order;
sign = read_sign();
exponent = read_exponent();
mantissa = read_mantissa();
my_order.spam = sign * mantissa * pow(10, exponent);
but that doesn't seem really efficient. I also could not find a
description of the representation of double and float. How should
one proceed before this?
If you want to be as portable as possible with floats you can use frexp and ldexp:
void WriteFloat (float number)
{
int exponent;
unsigned long mantissa;
mantissa = (unsigned int) (INT_MAX * frexp(number, &exponent);
WriteInt (exponent);
WriteUnsigned (mantissa);
}
float ReadFloat ()
{
int exponent = ReadInt();
unsigned long mantissa = ReadUnsigned();
float value = (float)mantissa / INT_MAX;
return ldexp (value, exponent);
}
The Idea behind this is, that ldexp, frexp and INT_MAX are standard C. Also the precision of an unsigned long is usually at least as high as the width of the mantissa (no guarantee, but it is a valid assumption and I don't know a single architecture that is different here).
Therefore the conversion works without precision loss. The division/multiplication with INT_MAX may loose a bit of precision during conversion, but that's a compromise one can live with.
If you are using C99 you can output real numbers in portable hex using %a.
If you are using IEEE-754 why not access the float or double as a unsigned short or unsigned long and save the floating point data as a series of bytes, then re-convert the "specialized" unsigned short or unsigned long back to a float or double on the other side of the transmission ... the bit-data would be preserved, so you should end-up with the same floating point number after transmission.
This answer uses Nils Pipenbrinck's method but I have changed a few details that I think help to ensure real C99 portability. This solution lives in an imaginary context where encode_int64 and encode_int32 etc already exist.
#include <stdint.h>
#include <math.h>
#define PORTABLE_INTLEAST64_MAX ((int_least64_t)9223372036854775807) /* 2^63-1*/
/* NOTE: +-inf and nan not handled. quickest solution
* is to encode 0 for !isfinite(val) */
void encode_double(struct encoder *rec, double val) {
int exp = 0;
double norm = frexp(val, &exp);
int_least64_t scale = norm*PORTABLE_INTLEAST64_MAX;
encode_int64(rec, scale);
encode_int32(rec, exp);
}
void decode_double(struct encoder *rec, double *val) {
int_least64_t scale = 0;
int_least32_t exp = 0;
decode_int64(rec, &scale);
decode_int32(rec, &exp);
*val = ldexp((double)scale/PORTABLE_INTLEAST64_MAX, exp);
}
This is still not a real solution, inf and nan can not be encoded. Also notice that both parts of the double carry sign bits.
int_least64_t is guaranteed by the standard (int64_t is not), and we use the least perimissible maximum for this type to scale the double. The encoding routines accept int_least64_t but will have to reject input that is larger than 64 bits for portability, the same for the 32 bit case.
The C standard doesn't define a representation for floating point types. Your best bet would be to convert them to IEEE-754 format and store them that way. Portability of binary serialization of double/float type in C++ may help you there.
Note that the C standard also doesn't specify a format for integers. While most computers you're likely to encounter will use a normal two's-complement representation with only endianness to be concerned about, it's also possible they would use a one's-complement or sign-magnitude representation, and both signed and unsigned ints may contain padding bits that don't contribute to the value.