conversion problems from long to float [duplicate] - c

I already search for this problem, but I didn't find nothing. The problem is that when I have a big int, like 16777217, and I cast it to float, the real value is 16777216. I cannot figure out how to avoid this problem. Note that I have to use int and float, because I pass the int to the fmodf function, which will automatically cast the int arguments to float.

float does have a limited number of bits so does not have have infinite precision.
16777217 is the first integer value that a float (IEEE-754 binary32 type) cannot represent precisely and it is stored as 16777216.
Use double if you want more precision.

Most machines today follows ISO/IEC/IEEE 60559:2011, or the identical IEEE-754 floating point standard.
The IEEE-754 single precision float can represent integer value precisely at most 2^24, which is exactly 16777216.
You can use double or long double instead. For instance, the IEEE-754 double precision double can represent integer value at most 2^53. (approximately 10^16)

Floating point numbers are imprecise. Integers are not.
Therefore to convert one to the other you will always have this problem.
So the solution is to decide on accuracy

Related

What are the limit values of int32 till which int32 to float conversion can work without rounding to nearest value?

I am trying to convert int32 values to float and when i try to convert values above 0x0FFFFF the last decimal pointed is always rounded to nearest value. I know that when a value is not fitting to the destination float member it will be rounded but i need to know which is the limit value for this.
e.g. 11111111 (0x69F6BC7) is printed as 111111112.0 .
The maximum integer value of a float significand is FLT_RADIX/FLT_EPSILON - 1. By “integer value” of a significand, I mean the value when it is scaled so that its lowest bit represents a value of 1.
The value FLT_RADIX/FLT_EPSILON is also representable in float, since it is a power of the radix. FLT_RADIX/FLT_EPSILON + 1 is not representable in float, so converting an integer to float might result in rounding if the integer exceeds FLT_RADIX/FLT_EPSILON in magnitude.
If it is known that INT_MAX exceeds FLT_RADIX/FLT_EPSILON, you can test this for a non-negative int x with (int) (FLT_RADIX/FLT_EPSILON) < x. If it is not known that FLT_RADIX/FLT_EPSILON can be converted to int successfully, more complicated tests may be needed.
Very commonly, C implementations use the IEEE-754 binary32 format, also known as “single precision,” for float. In this format, FLT_RADIX/FLT_EPSILON is 224 = 16,777,216.
These symbols are defined in <float.h>. For double or long double, replace FLT_EPSILON with DBL_EPSILON or LDBL_EPSILON. FLT_RADIX remains unchanged since it is the same for all formats.
Theoretically, a perverse floating-point format might have an abnormally small exponent range that makes FLT_RADIX/FLT_EPSILON - 1 not representable because the significand cannot be scaled high enough. This can be disregarded in practice.

C - Unsigned long long to double on 32-bit machine

Hi I have two questions:
uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.
Direct casting apparently doesn't work due to how double is defined.
Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.
As for an example:
//operation for convertion I used:
double sampleRate = (
(union { double i; uint64_t sampleRate; })
{ .i = r23u.outputSampleRate}
).sampleRate;
//the following are printouts on command line
// double uint64_t
//printed by %.16llx %.16llx
outputSampleRate 0x41886a0000000000 0x41886a0000000000 sampleRate
//printed by %f %llu
outputSampleRate 51200000.000000 4722140757530509312 sampleRate
So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong.
Thank you.
uint64_t vs double, which has a higher range limit for covering positive numbers?
uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.
Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.
IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.
How to convert double into uint64_t if only the whole number part of double is needed
You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:
double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;
my_uint = my_double;
my_other_uint = (uint64_t) my_double;
Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.
The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.
double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.
This code for example:
#include <stdio.h>
2 int main()
3 {
4 for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5 printf("%lf\n", i);
6 return 0;
7 }
gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.
However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.

Is there a case when an integer loses its precision when casted to double?

Suppose I have
int i=25;
double j=(double)i;
Is there a chance that j will have values 24.9999999..upto_allowed or 25.00000000..._upto_allowed_minus_one_and_then_1. I remember reading such stuff somehere but not able to recall properly.
In other words:
Is there a case when an integer loses its precision when casted to double?
For small numbers like 25, you are good. For very large (absolute) values of ints on architecture where int is 64 bit (having a value not representable in 53 bits) or more, you will loose the precision.
Double precision floating point number has 53 bits of precision of which Most significant bit is (implicitly) usually 1.
On Platforms where floating point representation is not IEEE-754, answer may be a little different. For more details you can refer chapter 5.2.4.2.2 of C99/C11 specs
An IEEE-754 double has a significand precision of 53-bits. This means it can store all signed integers within the range 2^53 and -2^53.
Because int typically has 32 bits on most compilers/architectures, double will usually be able to handle int.
#Mohit Jain answer is good for practicle coding.
By the C spec, DBL_DIG or FLT_RADIX/DBL_MANT_DIG and INT_MAX/INT_MIN are important values.
DBL_DIG in the max decimal digits a number can have that when converted to double and back will certainly have the same value. It is at least 10. So a whole number like 9,999,999,999 can certainly convert to a double and back without losing precision. Possible larger values can successfully round-trip too.
The real round-trip problem begin with integer values exceeding +/-power(FLT_RADIX, DBL_MANT_DIG). FLT_RADIX is the floating point base (and is overwhelmingly 2) and DBL_MANT_DIG is the "number of base-FLT_RADIX digits in the floating-point significand" such as 53 with IEEE-754 binary64.
Of course an int has the range [INT_MIN ... INT_MAX]. The range must be at least [-32767...+32,767].
When, mathematically, power(FLT_RADIX, DBL_MANT_DIG) >= INT_MAX, there is no conversion problems. This applies to all conforming C compilers.

Why floating point does not start from negative numbers when it exceeds its range?

As we all know when a integer varible exceeds its range it starts from other end that is from negative numbers. for example
int a=2147483648;
printf("%d",a);
OUTPUT:
-2147483648 (as I was expecting)
Now I tried the same for floating points.
for example
float a=3.4e39;//as largest float is 3.4e38
printf("%f",a);
OUTOUT:
1.#INF00 (I was expecting some negative float value)
I didn't get the above output exactly but I know It represents positive infinity.
So my question is simply why it does not start from other end(negative values like integers)?
Floating point numbers are stored in a different format than integer numbers, and don't follow the same over-/under-flowing mechanics.
More specifically, the binary bit-pattern for 2147483648 is 1000000000000000 which in a two's complement system (like the one used on almost all modern computers) is the same as -2147483648.
Most computers today uses IEEE754 format for floating point values, and those are handled quite differently from plain integers.
In IEEE-754, the maximum finite float (binary-32) value is below double value 3.4e39.
IEEE-754 says (for default rounding-direction attribute roundTiesToEven):
(IEEE-754:2008, 4.3.1 Rounding-direction attributes to nearest) "In the following two rounding-direction attributes, an infinitely precise result with magnitude at least
b emax (b − ½ b 1−p) shall round to ∞ with no change in sign; here emax and p are determined by the destination format (see 3.3)"
So in this declaration:
float a=3.4e39;
the conversion yields a positive infinity.
Under IEEE floating point, it's impossible for arithmetic to overflow because the representable range is [-INF,INF] (including the endpoints). As usual, floating point is subject to rounding when the exact value is not representable, and in your case, rounding yields INF.
Other answers have looked at floating point. This answer is about why signed integer values traditionally wrap around. It is not because that is particularly nice behavior. It is because that is what is expected because it is the way it has been done for a long time.
Especially in early hardware, with either discrete logic or very limited chip space, there was a major advantage to using the same adder for signed and unsigned integer addition and subtraction.
Floating point arithmetic was done in software except on special "scientific" computers that cost extra. Floating point numbers are always signed, and, as has been pointed out in other answers, have their own format. There is no signed/unsigned hardware sharing issue.
Common hardware for signed and unsigned integers can be achieved by using 2's complement representation for signed integer types.
What follows is based on 8 bit integers, with each bit pattern represented as 2 hexadecimal digits. Other widths work the same way.
00 through 7f have the same meaning in unsigned and 2's complement, 0 through 127 in that order, the intersection of the two ranges. 80 through ff represent 128 through 255, in that order, for unsigned integers, but represent negative numbers for signed. To make addition the same for both, 80 represents -128, and ff represents -1.
Now see what happens if you add 1 to 7f. For unsigned, it has to increment from 127 to 128. That means the resulting bit pattern is 80, which is also the most negative signed value. The price of sharing an adder is wrap-around at one point in the range.

Implicit types for numbers in C

What are the implicit types for numbers in C? If, for example, I have a decimal number in a calculation, is the decimal always treated as a double? If I have a non-decimal number, is it always treated as an int? What if my non-decimal number is larger than an int value?
I'm curious because this affects type conversion and promotion. For instance, if I have the following calculation:
float a = 1.0 / 25;
Is 1.0 treated as a double and 25 treated as an int? Is 25 then promoted to a double, the calculation performed at double precision and then the result converted to a float?
What about:
double b = 1 + 2147483649; // note that the number is larger than an int value
If the number has neither a decimal point nor an exponent, it is an integer of some sort; by default, an int.
If the number has a decimal point or an exponent, it is a floating point number of some sort; by default, a double.
That's about it. You can append suffixes to numbers (such as ULL for unsigned long long) to specify the type more precisely. Otherwise (simplifying a little), integers are the smallest int type (of type int or longer) that will hold the value.
In your examples, the code is:
float a = 1.0 / 25;
double b = 1 + 2147483649;
The value of a is calculated by noting that 1.0 is a double and 25 is an integer. When processing the division, the int is converted to a double, the calculation is performed (producing a double), and the result is then coerced into a float for assignment to a. All of this can be done by the compiler, so the result will be pre-computed.
Similarly, on a system with 32-bit int, the value 214783649 is too big to be an int, so it will be treated as a signed type bigger than int (either long or long long); the 1 is added (yielding the same type), and then that value is converted to a double. Again, it is all done at compile time.
These computations are governed by the same rules as other computations in C.
The type rules for integer constants are detailed in §6.4.4.1 Integer constants of ISO/IEC 9899:1999. There's a table which details the types depending on the suffix (if any) and the type of constant (decimal vs octal or hexadecimal). For decimal constants, the value is always a signed integer; for octal or hexadecimal constants, the type can be signed or unsigned as required, and as soon as the value fits. Thanks to Daniel Fischer for pointing out my mistake.
http://en.wikipedia.org/wiki/Type_conversion
The standard has a general guideline for what you can expect but compilers have a superset of rules that encompass the standard as well as rules for optimizing. The above link discusses some of the the generalities you can expect. If you are concerned about the implicit coercion it is typically good practice to use explicit casting.
Keep in mind that the size of the primitive types is not guaranteed.
1.0 / 25
Evaluates to a double because one of the operands is a double. If you changed it to 1/25 the evaluation is performed as two integers and evaluates to 0.
double b = 1 + 2147483649;
The right side is evaluated as an integer and then coerced to a double during assignment.
actually. in your example you may get a compiler warning. You'd either write 1.0f to make it a float to start with, or explicitly cast your result before assigning it.

Resources