Which numbers can a double type contain? (in C language)
I was trying to find the numbers that double can contain in c.
I know that a float can contain numbers between -10^38
In C, a double is usually an IEEE double. I don't know if this is required by the standard, but these days it would be unusual for it to be something else. Here's some info on double precision formats, particularly IEEE: double-precision floating point formats
Related
I am trying to understand what is the difference between the following:
printf("%f",4.567f);
printf("%f",4.567);
How does using the f suffix change/influence the output?
How using the 'f' changes/influences the output?
The f at the end of a floating point constant determines the type and can affect the value.
4.567 is floating point constant of type and precision of double. A double can represent exactly typical about 264 different values. 4.567 is not one on them*1. The closest alternative typically is exactly
4.56700000000000017053025658242404460906982421875 // best
4.56699999999999928235183688229881227016448974609375 // next best double
4.567f is floating point constant of type and precision of float. A float can represent exactly typical about 232 different values. 4.567 is not one on them. The closest alternative typically is exactly
4.566999912261962890625 // best
4.56700038909912109375 // next best float
When passed to printf() as part of the ... augments, a float is converted to double with the same value.
So the question becomes what is the expected difference in printing?
printf("%f",4.56700000000000017053025658242404460906982421875);
printf("%f",4.566999912261962890625);
Since the default number of digits after the decimal point to print for "%f" is 6, the output for both rounds to:
4.567000
To see a difference, print with more precision or try 4.567e10, 4.567e10f.
45670000000.000000 // double
45669998592.000000 // float
Your output may slightly differ to to quality of implementation issues.
*1 C supports many floating point encodings. A common one is binary64. Thus typical floating-point values are encoded as an sign * binary fraction * 2exponent. Even simple decimal values like 0.1 can not be represented exactly as such.
I'm having trouble understading why this code's output is 2147483648:
#include <stdio.h>
int main (void){
float f = 2147483638;
printf("%f",f);
}
I tried to find explanation using IEEE 754 standard for float representation but using my calculations I get that output should be 2147483520, not 2147483648.
Thanks for help!
That is the way that float works on your system.
Note that the C standard is intentionally flexible as to the type and sizes of the floating point types. A float does not have to be an IEEE754 32 bit floating point type.
I'm changing an uint32_t to a float but without changing the actual bits.
Just to be sure: I don't wan't to cast it. So float f = (float) i is the exact opposite of what I wan't to do because it changes bits.
I'm going to use this to convert my (pseudo) random numbers to float without doing unneeded math.
What I'm currently doing and what is already working is this:
float random_float( uint64_t seed ) {
// Generate random and change bit format to ieee
uint32_t asInt = (random_int( seed ) & 0x7FFFFF) | (0x7E000000>>1);
// Make it a float
return *(float*)(void*)&asInt; // <-- pretty ugly and nees a variable
}
The Question: Now I'd like to get rid of the asInt variable and I'd like to know if there is a better / not so ugly way then getting the address of this variable, casting it twice and dereferencing it again?
You could try union - as long as you make sure the types are identical in memory sizes:
union convertor {
int asInt;
float asFloat;
};
Then you can assign your int to asFloat (or the other way around if you want to). I use it a lot when I need to do bitwise operations on one hand and still get a uint32_t representation on the number on the other hand
[EDIT]
Like many of the commentators rightfully state, you must take into consideration values that are not presentable by integers like like NAN, +INF, -INF, +0, -0.
So you seem to want to generate floating point numbers between 0.5 and 1.0 judging from your code.
Assuming that your microcontroller has a standard C library with floating point support, you can do this all standards compliant without actually involving any floating point operations, all you need is the ldexp function that itself doesn't actually do any floating point math.
This would look something like this:
return ldexpf((1 << 23) + random_thing_smaller_than_23_bits(), -24);
The trick here is that we happen to know that IEEE754 binary32 floating point numbers have integer precision between 2^23 and 2^24 (I could be off-by-one here, double check please, I'm translating this from some work I've done on doubles). So the compiler should know how to convert that number to a float trivially. Then ldexp multiplies that number by 2^-24 by just changing the bits in the exponent. No actual floating point operations involved and no undefined behavior, the code is fully portable to any standard C implementation with IEEE754 numbers. Double check the generated code, but a good compiler and c library should not use any floating point instructions here.
If you want to peek at some experiments I've done around generating random floating point numbers you can peek at this github repo. It's all about doubles, but should be trivially translatable to floats.
Reinterpreting the binary representation of an int to a float would result in major problems:
There are a lot of undefined codes in the binary representation of a float.
Other codes represent special conditions, like NAN, +INF, -INF, +0, -0 (sic!), etc.
Also, if that is a random value, even if catching all non-value representations, that would yield a very bad random distribution.
If you are working on an MCU without FPU, you should better think about avoiding float at all. An alternative might be fraction or scaled integers. There are many implementations of algorithms which use float, but can be easily converted to fixed point types with acceptable loss of precision (or even none at all). Some might even yield more precision than float (note that single precision float has only 23 bits of mantissa, an int32 would have 31 bits (+ 1 sign for either), same for a fractional or fixed scaled int.
Note that C11 added (optional) support for _Frac. You might want to research on that.
Edit:
According you your comments, you seem to convert the int to a float in range 0..<1. For that, you can assemble the float using bit operations on an uint32_t (e.g. the original value). You just need to follow the IEEE format (presumed your toolchain does comply to the C standard! See wikipedia.
The result (still uint32_t) can then be reinterpreted by a union or pointer as described by others already. Pack that in a system-dependent, well-commented library and dig it deep. Do not forget to check about endianess and alignment (likely both the same for float and uint32_t, but important for the bit-ops).
I need to deal with very large matrices and/or large numbers and I don't know why
double result = 2251.000000 * 9488.000000 + 7887.000000 * 8397.000000;
gives me the correct output of 87584627.000000.
Same with int result.
However, if I use float result = 2251.000000f + ... etc,
it gives me 87584624.000000 and I have no idea why!
Can somebody tell me what I'm missing?
The most common format for floating point numbers in C is the IEEE-754 format, described in this wikipedia article. The binary32 format corresponds to a float, and the binary64 format corresponds to a double.
A float has just over 7 decimal digits of precision. Since the answer to your equation has 8 significant digits, the answer cannot be exactly represented as a float.
A double has almost 16 decimal digits of precision, and therefore does have an exact representation of the answer. Therefore, in general, when you are doing general purpose mathematics, you should be using doubles. However, it's important to note that even a double may not have enough precision for every application. For example, the national debt of the United States is 18,149,752,816,959.61 which barely fits into a double.
I have read this many times in C programming language that by default any floating-point value is of type double not float, So when we write.
float PI = 3.14;
then PI will have a value 3.14-some_small_value, this is due to precision consideration, because 8 bytes values is now assigned to 4 bytes variable. Can anyone please explain that how does it happen in terms of memory, or internally how these values are changed.
In this case the change you observe does not have much to do with conversion from 8-byte format to 4-byte format. Your PI will be different from 3.14 regardless of whether you declare it as float or double. Value 3.14 is impossible to represent precisely in binary floating-point format. 4, 8 or 1234 bytes is still not enough to build a precise binary representation of 3.14, since in traditional binary floating-point format this representation is infinitely long.
The exact binary representation of 3.14 is
11.001000111101011100001010001010001111010... = 11.0(01000111101011100001010001)
meaning that the 01000111101011100001010001 part is repeated again and again indefinitely. When you use float the whole representation is truncated (or rounded) to the capacity of float, while in case of double it is truncated to the capacity of double. For this reason, any floating-point type will represent 3.14 only approximately. double will be more precise than float, but still not absolutely accurate.
This rounding is exactly what turns 3.14 into 3.1400001049041748046875 in case of float and into 3.140000000000000124344978758017532527446747 in case of double (results from GCC compiler). Your compiler might use a different rounding strategy, resulting in slightly different values.