Number of bits assigned for double data type - c

How many bits out of 64 is assigned to integer part and fractional part in double. Or is there any rule to specify it?

Note: I know I already replied with a comment. This is for my own benefit as much as the OPs; I always learn something new when I try to explain it.
Floating-point values (regardless of precision) are represented as follows:
sign * significand * βexp
where sign is 1 or -1, β is the base, exp is an integer exponent, and significand is a fraction. In this case, β is 2. For example, the real value 3.0 can be represented as 1.102 * 21, or 0.112 * 22, or even 0.0112 * 23.
Remember that a binary number is a sum of powers of 2, with powers decreasing from the left. For example, 1012 is equivalent to 1 * 22 + 0 * 21 + 1 * 20, which gives us the value 5. You can extend that past the radix point by using negative powers of 2, so 101.112 is equivalent to
1 * 22 + 0 * 21 + 1 * 20 + 1 * 2-1 + 1 * 2-2
which gives us the decimal value 5.75. A floating-point number is normalized such that there's a single non-zero digit prior to the radix point, so instead of writing 5.75 as 101.112, we'd write it as 1.01112 * 22
How is this encoded in a 32-bit or 64-bit binary format? The exact format depends on the platform; most modern platforms use the IEEE-754 specification (which also specifies the algorithms for floating-point arithmetic, as well as special values as infinity and Not A Number (NaN)), however some older platforms may use their own proprietary format (such as VAX G and H extended-precision floats). I think x86 also has a proprietary 80-bit format for intermediate calculations.
The general layout looks something like the following:
seeeeeeee...ffffffff....
where s represents the sign bit, e represents bits devoted to the exponent, and f represents bits devoted to the significand or fraction. The IEEE-754 32-bit single-precision layout is
seeeeeeeefffffffffffffffffffffff
This gives us an 8-bit exponent (which can represent the values -126 through 127) and a 22-bit significand (giving us roughly 6 to 7 significant decimal digits). A 0 in the sign bit represents a positive value, 1 represents negative. The exponent is encoded such that 000000012 represents -126, 011111112 represents 0, and 111111102 represents 127 (000000002 is reserved for representing 0 and "denormalized" numbers, while 111111112 is reserved for representing infinity and NaN). This format also assumes a hidden leading fraction bit that's always set to 1. Thus, our value 5.75, which we represent as 1.01112 * 22, would be encoded in a 32-bit single-precision float as
01000000101110000000000000000000
|| || |
|| |+----------+----------+
|| | |
|+--+---+ +------------ significand (1.0111, hidden leading bit)
| |
| +---------------------------- exponent (2)
+-------------------------------- sign (0, positive)
The IEEE-754 double-precision float uses 11 bits for the exponent (-1022 through 1023) and 52 bits for the significand. I'm not going to bother writing that out (this post is turning into a novel as it is).
Floating-point numbers have a greater range than integers because of the exponent; the exponent 127 only takes 8 bits to encode, but 2127 represents a 38-digit decimal number. The more bits in the exponent, the greater the range of values that can be represented. The precision (the number of significant digits) is determined by the number of bits in the significand. The more bits in the significand, the more significant digits you can represent.
Most real values cannot be represented exactly as a floating-point number; you cannot squeeze an infinite number of values into a finite number of bits. Thus, there are gaps between representable floating point values, and most values will be approximations. To illustrate the problem, let's look at an 8-bit "quarter-precision" format:
seeeefff
This gives us an exponent between -7 and 8 (we're not going to worry about special values like infinity and NaN) and a 3-bit significand with a hidden leading bit. The larger our exponent gets, the wider the gap between representable values gets. Here's a table showing the issue. The left column is the significand; each additional column shows the values we can represent for the given exponent:
sig -1 0 1 2 3 4 5
--- ---- ----- ----- ----- ----- ----- ----
000 0.5 1 2 4 8 16 32
001 0.5625 1.125 2.25 4.5 9 18 36
010 0.625 1.25 2.5 5 10 20 40
011 0.6875 1.375 2.75 5.5 11 22 44
100 0.75 1.5 3 6 12 24 48
101 0.8125 1.625 3.25 6.5 13 26 52
110 0.875 1.75 3.5 7 14 28 56
111 0.9375 1.875 3.75 7.5 15 30 60
Note that as we move towards larger values, the gap between representable values gets larger. We can represent 8 values between 0.5 and 1.0, with a gap of 0.0625 between each. We can represent 8 values between 1.0 and 2.0, with a gap of 0.125 between each. We can represent 8 values between 2.0 and 4.0, with a gap of 0.25 in between each. And so on. Note that we can represent all the positive integers up to 16, but we cannot represent the value 17 in this format; we simply don't have enough bits in the significand to do so. If we add the values 8 and 9 in this format, we'll get 16 as a result, which is a rounding error. If that result is used in any other computation, that rounding error will be compounded.
Note that some values cannot be represented exactly no matter how many bits you have in the significand. Just like 1/3 gives us the non-terminating decimal fraction 0.333333..., 1/10 gives us the non-terminating binary fraction 1.10011001100.... We would need an infinite number of bits in the significand to represent that value.

a double on a 64 bit machine, has one sign bit, 11 exponent bits and 52 fractional bits.
think (1 sign bit) * (52 fractional bits) ^ (11 exponent bits)

Related

float %.2f round but double %.2lf not rounding and how to bypass it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point comparison
I have a problem about the accuracy of float in C/C++. When I execute the program below:
#include <stdio.h>
int main (void) {
float a = 101.1;
double b = 101.1;
printf ("a: %f\n", a);
printf ("b: %lf\n", b);
return 0;
}
Result:
a: 101.099998
b: 101.100000
I believe float should have 32-bit so should be enough to store 101.1 Why?
You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.
There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).
If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.
Applying the knowledge from that answer to your 101.1 number (as a single precision float):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 10000101 10010100011001100110011
| | | || || || |+- 8388608
| | | || || || +-- 4194304
| | | || || |+----- 524288
| | | || || +------ 262144
| | | || |+--------- 32768
| | | || +---------- 16384
| | | |+------------- 2048
| | | +-------------- 1024
| | +------------------ 64
| +-------------------- 16
+----------------------- 2
The mantissa part of that actually continues forever for 101.1:
mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).
hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.
Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 26 or 64.
The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.
When you add all these up, you get 1.57968747615814208984375.
When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.
All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:
#include <stdio.h>
int main (void) {
float f = 101.1f;
printf ("%.50f\n", f);
return 0;
}
which also gave me 101.09999847412109375000....
You need to read more about how floating-point numbers work, especially the part on representable numbers.
You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.
Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.
This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.
Your number 101.1 in base 10 is 1100101.0(0011) in base 2. The 0011 part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.
Looking at the IEE754 standard for floating points, you can find out why the double version seemed to show it entirely.
PS: Derivation of 101.1 in base 10 is 1100101.0(0011) in base 2:
101 = 64 + 32 + 4 + 1
101 -> 1100101
.1 * 2 = .2 -> 0
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2....
PPS: It's the same if you'd wanted to store exactly the result of 1/3 in base 10.
What you see here is the combination of two factors:
IEEE754 floating point representation is not capable of accurately representing a whole class of rational and all irrational numbers
The effects of rounding (by default here to 6 decimal places) in printf. That is say that the error when using a double occurs somewhere to the right of the 6th DP.
If you had more digits to the print of the double you'll see that even double cannot be represented exactly:
printf ("b: %.16f\n", b);
b: 101.0999999999999943
The thing is float and double are using binary format and not all floating pointer numbers can be represented exactly with binary format.
Unfortunately, most decimal floating point numbers cannot be accurately represented in (machine) floating point. This is just how things work.
For instance, the number 101.1 in binary will be represented like 1100101.0(0011) ( the 0011 part will be repeated forever), so no matter how many bytes you have to store it, it will never become accurate. Here is a little article about binary representation of floating point, and here you can find some examples of converting floating point numbers to binary.
If you want to learn more on this subject, I could recommend you this article, though it's long and not too easy to read.

Why floating-points number's significant numbers is 7 or 6

I see this in Wikipedia log 224 = 7.22.
I have no idea why we should calculate 2^24 and why we should take log10......I really really need your help.
why floating-points number's significant numbers is 7 or 6 (?)
Consider some thoughts employing the Pigeonhole principle:
binary32 float can encode about 232 different numbers exactly. The numbers one can write in text like 42.0, 1.0, 3.1415623... are infinite, even if we restrict ourselves to a range like -1038 ... +1038. Any time code has a textual value like 0.1f, it is encoded to a nearby float, which may not be the exact same text value. The question is: how many digits can we code and still maintain distinctive float?
For the various powers-of-2 range, 223 (8,388,608) values are normally linearly encoded.
Example: In the range [1.0 ... 2.0), 223 (8,388,608) values are linearly encoded.
In the range [233 or 8,589,934,592 ... 234 or 17,179,869,184), again, 223 (8,388,608) values are linearly encoded: 1024.0 apart from each other. In the sub range [9,000,000,000 and 10,000,000,000), there are about 976,562 different values.
Put this together ...
As text, the range [1.000_000 ... 2.000_000), using 1 lead digit and 6 trailing ones, there are 1,000,000 different values. Per #3, In the same range, with 8,388,608 different float exist, allowing each textual value to map to a different float. In this range we can use 7 digits.
As text, the range [9,000,000 × 103 and 10,000,000 × 103), using 1 lead digit and 6 trailing ones, there are 1,000,000 different values. Per #4, In the same range, there are less than 1,000,000 different float values. Thus some decimal textual values will convert to the same float. In this range we can use 6, not 7, digits for distinctive conversions.
The worse case for typical float is 6 significant digits. To find the limit for your float:
#include <float.h>
printf("FLT_DIG = %d\n", FLT_DIG); // this commonly prints 6
... no idea why we should calculate 2^24 and why we should take log10
224 is a generalization as with common float and its 24 bits of binary precision, that corresponds to fanciful decimal system with 7.22... digits. We take log10 to compare the binary float to decimal text.
224 == 107.22...
Yet we should not take 224. Let us look into how FLT_DIG is defined from C11dr §5.2.4.2.2 11:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
p log10 b ............. if b is a power of 10
⎣(p − 1) log10 _b_⎦.. otherwise
Notice "log10 224" is same as "24 log10 2".
As a float, the values are distributed linearly between powers of 2 as shown in #2,3,4.
As text, values are distributed linearly between powers of 10 like a 7 significant digit values of [1.000000 ... 9.999999]*10some_exponent.
The transition of these 2 groups happen at different values. 1,2,4,8,16,32... versus 1,10,100, ... In determining the worst case, we subtract 1 from the 24 bits to account for the mis-alignment.
⎣(p − 1) log10 _b_⎦ --> floor((24 − 1) log10(2)) --> floor(6.923...) --> 6.
Had our float used base 10, 100, or 1000, rather than very common 2, the transition of these 2 groups happen at same values and we would not subtract one.
An IEEE 754 single-precision float has a 24-bit mantissa. This means it has 24 binary bits' worth of precision.
But we might be interested in knowing how many decimal digits worth of precision it has.
One way of computing this is to consider how many 24-bit binary numbers there are. The answer, of course, is 224. So these binary numbers go from 0 to 16777215.
How many decimal digits is that? Well, log10 gives you the number of decimal digits. log10(224) is 7.2, or a little more than 7 decimal digits.
And look at that: 16777215 has 8 digits, but the leading digit is just 1, so in fact it's only a little more than 7 digits.
(Of course this doesn't mean we can represent only numbers from 0 to 16777215! It means we can represent numbers from 0 to 16777215 exactly. But we've also got the exponent to play with. We can represent numbers from 0 to 1677721.5 more or less exactly to one place past the decimal, numbers from 0 to 167772.15 more or less exactly to two decimal points, etc. And we can represent numbers from 0 to 167772150, or 0 to 1677721500, but progressively less exactly -- always with ~7 digits' worth of precision, meaning that we start losing precision in the low-order digits to the left of the decimal point.)
The other way of doing this is to note that log10(2) is about 0.3. This means that 1 bit corresponds to about 0.3 decimal digits. So 24 bits corresponds to 24 × 0.3 = 7.2.
(Actually, IEEE 754 single-precision floating point explicitly stores only 23 bits, not 24. But there's an implicit leading 1 bit in there, so we do get the effect of 24 bits.)
Let's start a little smaller. With 10 bits (or 10 base-2 digits), you can represent the numbers 0 upto 1023. So you can represent up to 4 digits for some values, but 3 digits for most others (the ones below 1000).
To find out how many base-10 (decimal) digits can be represented by a bunch of base-2 digits (bits), you can use the log10() of the maximum representable value, i.e. log10(2^10) = log10(2) * 10 = 3.01....
The above means you can represent all 3 digit — or smaller — values and a few 4 digits ones. Well, that is easily verified: 0-999 have at most 3 digits, and 1000-1023 have 4.
Now take 24 bits. In 24 bits you can store log10(2^24) = 24 * log(2) base-10 digits. But because the top bit is always the same, you can in fact only store log10(2^23) = log10(8388608) = 6.92. This means you can represent most 7 digits numbers, but not all. Some of the numbers you can represent faithfully can only have 6 digits.
The truth is a bit more complicated though, because exponents play role too, and some of the many possible larger values can be represented too, so 6.92 may not be the exact value. But it gets close, and can nicely serve as a rule of thumb, and that is why they say that single precision can represent 6 to 7 digits.

Floating point conversion - Binary -> decimal

Here's the number I'm working on
1 01110 001 = ____
1 sign bit, 5 exp bits, 3 fraction bits
bias = 15
Here's my current process, hopefully you can tell me where I'm missing something
Convert binary exponent to decimal
01110 = 14
Subtract bias
14 - 15 = -1
Multiply fraction bits by result
0.001 * 2^-1 = 0.0001
Convert to decimal
.0001 = 1/16
The sign bit is 1 so my result is -1/16, however the given answer is -9/16. Would anyone mind explaining where the extra 8 in the fraction is coming from?
You seem to have the correct concept, including an understanding of the excess-N representation, but you're missing a crucial point.
The 3 bits used to encode the fractional part of the magnitude are 001, but there is an implicit 1. preceding the fraction bits, so the full magnitude is actually 1.001, which can be represented as an improper fraction as 1+1/8 => 9/8.
2^(-1) is the same as 1/(2^1), or 1/2.
9/8 * 1/2 = 9/16. Take the sign bit into account, and you arrive at the answer -9/16.
For normalized floating point representation, the Mantissa (fractional bits) = 1 + f. This is sometimes called an implied leading 1 representation. This is a trick for getting an additional bit of precision for free since we can always adjust the exponent E so that significant M is in the range 1<=M < 2 ...
You are almost correct but must take into consideration the implied 1. If it is denormalized (meaning the exponent bits are all 0s) you do not add an implied 1.
I would solve this problem as such...
1 01110 001
bias = 2^(k-1) -1 = 14
Exponent = e - bias
14 - 15 = -1
Take the fractional bits ->> 001
Add the implied 1 ->> 1.001
Shift it by the exponent, which is -1. Becomes .1001
Count up the values, 1(1/2) + 0(1/4) + 0(1/8) + 1(1/16) = 9/16
With the a negative sign bit it becomes -9/16
hope that helps!

Problematic understanding of IEEE 754 [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
First of all i woild like to point out that i am not native speaker and i really need some terms used more commonly.
And the second thing i would like to mention is that i am not a math genious. I am really trying to understand everything about programming.. but ieee-754 makes me think that it'll never happan.. its full of mathematical terms i don't understand..
What is precision? What is it used for? What is mantissa and what is mantissa used for? How to determine the range of float/double by their size? What is ± symbol (Plus-minus) used for? (i believe its positive/negative choice but what does that have to do with everything?),
Isn't there any brief and clean explanation you guys could provide me with?
I spent 600 years of trying to understand wikipedia. I failed tremendously.
What is precision?
It refers to how closely a binary floating point representation can represent a real value. Real values have infinite precision and infinite range. Digital values have finite range and precision. In practice a single-precision IEEE-754 can represent real values of a precision of 6 significant figures (decimal), while double-precision is good for 15 significant figures.
The practical effect of this for example is that a single precision value: 123456000.00 cannot be distinguished from say 123456001.00, but equally a value 0.00123456 can be represented.
What is it used for?
Precision is not used for anything other than to define a characteristic of a particular floating point representation.
What is mantissa and what is mantissa used for?
The term is not mentioned in the English language Wikipedia article, and is imprecise - in mathematics in general it has a different meaning that that used here.
The correct term is significand. For a decimal value 0.00123456 for example the significand is is 123456. 123456000.00 has exactly the same significand. Each of these values has the same significand but a different exponent. The exponent is a scaling factor which determines where the decimal point is (hence floating point).
Of course IEEE754 is a binary floating point representation not decimal, but for the same of explanation of the terms it is perhaps easier to use decimal.
How to determine the range of float/double by their size?
By the size alone you cannot; you need to know how many bits are assigned to the significand and how many bits are assigned to the exponent. In C however the range is defined by the macros FLT_MIN, FLT_MAX, DBL_MIN and DBL_MAX in the float.h header. Other characteristics of the implementations floating point representation are described there also.
Note that a specific compiler may not in fact use IEEE754, however that is the format used by most hardware FPU implementations, and the compiler will naturally follow that. For targets with no FPU (small embedded processors typically), other formats may be used.
What is ± symbol (Plus-minus) used for?
It simply means that the value given may be both positive or negative. It may refer to a specific value, or it may indicate a range. So ±n may refer to two discrete values -n or +n, or it may mean a range -n to +n. Context is everything! In this article it refers to discrete values +0, -0, +∞ and -∞.
There are 3 different components: sign, exponent, mantissa
Assuming that the exponent has only 2 Bits, 4 combinations are possible:
binary decimal
00 0
01 1
10 2
11 3
The represented floating-point value is 2exponent:
binary exponent-value
00 2^0 = 1
01 2^1 = 2
10 2^2 = 4
11 2^3 = 8
The range of the floating point value, results from the exponent. 2 bits => maximum value = 8.
The mantissa divide the range from a given exponent to the next higher exponent.
For example the exponent is 2 and the mantissa has one bit, then there are two values possible:
exponent-value mantissa-binary represented floating-point value
2 0 2
2 1 3
The represented floating-point value is 2exponent × (1 + m1×2-1 + m2×2-2 + m3×2-3 + …).
Here an example with a 3 bit mantissa:
exponent-value mantissa-binary represented floating-point value
2 000 2 * (1 ) = 2
2 001 2 * (1 + 2^-3) = 2,25
2 010 2 * (1 + 2^-2 ) = 2,5
2 011 2 * (1 + 2^-2 + 2^-3) = 2,75
2 100 2 * (1 + 2^-1 ) = 3
and so on…
The sign has only just one Bit:
0 -> positive value
1 -> negative value
In IEEE-754 a 32 bit floating-point data type has an 8 bit exponent (with a range from 2-127 to 2128) and a 23 bit mantissa.
1 10000010 01101000000000000000000
- 130 1,40625
The represented floating-point value for this is:
-1 × 2(130 – 127) × (1 + 2-2 + 2-3 + 2-5) = -11,25
Try it: http://www.h-schmidt.net/FloatConverter/IEEE754.html

SqlServer: what means precision 53?

For example FLOAT type have default numeric precision 53
SELECT * FROM sys.types
and TDS protocol sends precision 53, but through OLEDB precision returns as 18.
What exactly precision 53 means, it seems impossible to put as many decimal digits into 8 bytes? How OLEDB gets 18 from that number? I found MaxLenFromPrecision in Microsoft sources but this array up to precision 37 for numeric types only. FreeTDS have array up to 77 tds_numeric_bytes_per_prec but isn't similar with Microsoft's data and item by index 53 not 18 as OLEDB returns. By that is 53 is some virtual number while 18 real precision as described at http://msdn.microsoft.com/en-us/library/ms190476.aspx ?
What exactly precision 53 means, it seems impossible to put as many decimal digits into 8 bytes?
Precision 53 is 53 bits. 8 bytes contain 64 bits.
A double precision floating point number has 1 sign bit, 11 exponent bits, and 53 mantissa bits.
From BOL: http://msdn.microsoft.com/en-us/library/ms173773.aspx
Where n is the number of bits that are used to store the mantissa of
the float number in scientific notation and, therefore, dictates the
precision and storage size. If n is specified, it must be a value
between 1 and 53. The default value of n is 53.
53 is the precision set at the database level if you don't explicitly set it when declaring your float data type.

Resources