Floating point operation in c - c

We know that in C, the floating point range is from 1.xxxx * 10^-38 to 3.xxxx *10^38 for single precision.
On my lecture slides there is this operation:
(10^10 + 10^30) + (-10^30) ?= 10^10 + (10^30 + -10^30)
10^30 - 10^30 ?= 10^10 + 0
I'm wondering why 10^10 + 10^30 = 10^30 in this case?
What I thought is, since the range of FP can go down to 10^-38 and up to 10^38, there shouldn't be an overflow, so`10^10 + 10^30 shouldn't end up being 10^30.

As said in the comment to your question the part which store the digits is finite. It is referred to as the significand.
Consider the following simple 14 bit format:
[sign bit] [ 5 bit exponent] [ 8 bit significand]
let 'bias' be 16, ie if the exponent is 16 it is actually 0 (so we get a good range or +/- powers)
and no implied bits
so if we have numbers greater than 2^8 apart like 2048 and 0.5
in our format:
2048 = 2^11 = [0][11011][1000 0000]
0.5 = 2^-1 = [0][01111][1000 0000]
when we add these numbers we shift the exponent so that they have the same decimal places. A decimal analogy is:
5 x 10 ^ 3 + 5 x 10 ^ -2 => 5 x 10^3 + 0.00005 x 10 ^ 3
so the siginifcand cant hold 12 places:
2 ^ 11 + 0.000000000001 x 2 ^ 11 = 1.000000000001 x 2 ^ 11
so it rounds back to 2 ^ 11

The essence is the notion of significant digits. It's roughly 7 decimal digits for IEEE754 float. If we use hypothetical decimal floating point numbers with 7 significant digits, the calculation is done in this way:
10^10 + 10^30 == 1.000 000 * 10^10 + 1.000 000 * 10^30
== (0.000 000 000 000 000 000 01 + 1.000 000) * 10^30 (match the exponent part)
=> (0.000 000 + 1.000 000) * 10^30 (round the left operand)
== 1.000 000 * 10^30
== 10^30
Note however that the matching operation and the rounding operation are done as a single step, ie. the machine can never deal with 0.000 000 000 000 000 000 01 * 10^30 which has too many significant digits.
By the way, if you conduct experiments on floating point arithmetics in C, you may find %a format specifier useful (introduced in C99.) But note that printf always implicitly converts float arguments to double.
#include <stdio.h>
int main() {
float x = 10e10, y = 10e30;
printf("(%a + %a) == %a == %a\n", x, y, x+y, y);
return 0;
}
http://ideone.com/WeXe22

Related

float %.2f round but double %.2lf not rounding and how to bypass it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point comparison
I have a problem about the accuracy of float in C/C++. When I execute the program below:
#include <stdio.h>
int main (void) {
float a = 101.1;
double b = 101.1;
printf ("a: %f\n", a);
printf ("b: %lf\n", b);
return 0;
}
Result:
a: 101.099998
b: 101.100000
I believe float should have 32-bit so should be enough to store 101.1 Why?
You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.
There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).
If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.
Applying the knowledge from that answer to your 101.1 number (as a single precision float):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 10000101 10010100011001100110011
| | | || || || |+- 8388608
| | | || || || +-- 4194304
| | | || || |+----- 524288
| | | || || +------ 262144
| | | || |+--------- 32768
| | | || +---------- 16384
| | | |+------------- 2048
| | | +-------------- 1024
| | +------------------ 64
| +-------------------- 16
+----------------------- 2
The mantissa part of that actually continues forever for 101.1:
mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).
hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.
Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 26 or 64.
The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.
When you add all these up, you get 1.57968747615814208984375.
When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.
All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:
#include <stdio.h>
int main (void) {
float f = 101.1f;
printf ("%.50f\n", f);
return 0;
}
which also gave me 101.09999847412109375000....
You need to read more about how floating-point numbers work, especially the part on representable numbers.
You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.
Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.
This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.
Your number 101.1 in base 10 is 1100101.0(0011) in base 2. The 0011 part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.
Looking at the IEE754 standard for floating points, you can find out why the double version seemed to show it entirely.
PS: Derivation of 101.1 in base 10 is 1100101.0(0011) in base 2:
101 = 64 + 32 + 4 + 1
101 -> 1100101
.1 * 2 = .2 -> 0
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2....
PPS: It's the same if you'd wanted to store exactly the result of 1/3 in base 10.
What you see here is the combination of two factors:
IEEE754 floating point representation is not capable of accurately representing a whole class of rational and all irrational numbers
The effects of rounding (by default here to 6 decimal places) in printf. That is say that the error when using a double occurs somewhere to the right of the 6th DP.
If you had more digits to the print of the double you'll see that even double cannot be represented exactly:
printf ("b: %.16f\n", b);
b: 101.0999999999999943
The thing is float and double are using binary format and not all floating pointer numbers can be represented exactly with binary format.
Unfortunately, most decimal floating point numbers cannot be accurately represented in (machine) floating point. This is just how things work.
For instance, the number 101.1 in binary will be represented like 1100101.0(0011) ( the 0011 part will be repeated forever), so no matter how many bytes you have to store it, it will never become accurate. Here is a little article about binary representation of floating point, and here you can find some examples of converting floating point numbers to binary.
If you want to learn more on this subject, I could recommend you this article, though it's long and not too easy to read.

Convert continues binary fraction to decimal fraction in C

I implemented a digit-by-digit calculation of the square root of two. Each round it will outpute one bit of the fractional part e.g.
1 0 1 1 0 1 0 1 etc.
I want to convert this output to decimal numbers:
4 1 4 2 1 3 6 etc.
The issue I´m facing is, that this would generally work like this:
1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 etc.
I would like to avoid fractions altogether, as I would like to work with integers to convert from binary to decimal. Also I would like to print each decimal digit as soon as it has been computed.
Converting to hex is trivial, as I only have to wait for 4 bits. Is there a smart aproach to convert to base10 which allows to observe only a part of the whole output and idealy remove digits from the equation, once we are certain, that it wont change anymore, i.e.
1 0
2 0,25
3 0,375
4 0,375
5 0,40625
6 0,40625
7 0,4140625
8 0,4140625
After processing the 8th bit, I´m pretty sure that 4 is the first decimal fraction digit. Therefore I would like to remove 0.4 complelty from the equation to reduce the bits I need to take care of.
Is there a smart approach to convert to base10 which allows to observe only a part of the whole output and ideally remove digits from the equation, once we are certain that it wont change anymore (?)
Yes, eventually in practice, but in theory, no in select cases.
This is akin to the Table-maker's dilemma.
Consider the below handling of a value near 0.05. As long as the binary sequence is .0001 1001 1001 1001 1001 ... , we cannot know it the decimal equivalent is 0.04999999... or 0.05000000...non-zero.
int main(void) {
double a;
a = nextafter(0.05, 0);
printf("%20a %.20f\n", a, a);
a = 0.05;
printf("%20a %.20f\n", a, a);
a = nextafter(0.05, 1);
printf("%20a %.20f\n", a, a);
return 0;
}
0x1.9999999999999p-5 0.04999999999999999584
0x1.999999999999ap-5 0.05000000000000000278
0x1.999999999999bp-5 0.05000000000000000971
Code can analyse the incoming sequence of binary fraction bits and then ask two questions, after each bit: "if the remaining bits are all 0" what is it in decimal?" and "if the remaining bits are all 1" what is it in decimal?". In many cases, the answers will share common leading significant digits. Yet as shown above, as long as 1001 is received, there are no common significant decimal digits.
A usual "out" is to have an upper bound as to the number of decimal digits that will ever be shown. In that case code is only presenting a rounded result and that can be deduced in finite time even if the binary input sequence remains 1001
ad nauseam.
The issue I´m facing is, that this would generally work like this:
1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 etc.
Well 1/2 = 5/10 and 1/4 = 25/100 and so on which means you will need powers of 5 and shift the values by powers of 10
so given 0 1 1 0 1
[1] 0 * 5 = 0
[2] 0 * 10 + 1 * 25 = 25
[3] 25 * 10 + 1 * 125 = 375
[4] 375 * 10 + 0 * 625 = 3750
[5] 3750 * 10 + 1 * 3125 = 40625
Edit:
Is there a smart aproach to convert to base10 which allows to observe only a part of the whole output and idealy remove digits from the equation, once we are certain, that it wont change anymore
It might actually be possible to pop the most significant digits(MSD) in this case. This will be a bit long but please bear with me
Consider the values X and Y:
If X has the same number of digits as Y, then the MSD will change.
10000 + 10000 = 20000
If Y has 1 or more digits less than X, then the MSD can change.
19000 + 1000 = 20000
19900 + 100 = 20000
So the first point is self explanatory but the second point is what will allow us to pop the MSD. The first thing we need to know is that the values we are adding is continuously being divided in half every iteration. Which means that if we only consider the MSD, the largest value in base10 is 9 which will produce the sequence
9 > 4 > 2 > 1 > 0
If we sum up these values it will be equal to 16, but if we try to consider the values of the next digits (e.g. 9.9 or 9.999), the value actually approaches 20 but it doesn't exceed 20. What this means is that if X has n digits and Y has n-1 digits, the MSD of X can still change. But if X has n digits and Y has n-2 digits, as long as the n-1 digit of X is less than 8, then the MSD will not change (otherwise it would be 8 + 2 = 10 or 9 + 2 = 11 which means that the MSD will change). Here are some examples
Assuming X is the running sum of sqrt(2) and Y is 5^n:
1. If X = 10000 and Y = 9000 then the MSD of X can change.
2. If X = 10000 and Y = 900 then the MSD of X will not change.
3. If X = 19000 and Y = 900 then the MSD of X can change.
4. If X = 18000 and Y = 999 then the MSD of X can change.
5. If X = 17999 and Y = 999 then the MSD of X will not change.
6. If X = 19990 and Y = 9 then the MSD of X can change.
In the example above, on point #2 and #5, the 1 can already be popped. However for point #6, it is possible to have 19990 + 9 + 4 = 20003, but this also means that both 2 and 0 can be popped after that happened.
Here's a simulation for sqrt(2)
i Out X Y flag
-------------------------------------------------------------------
1 0 5 0
2 25 25 1
3 375 125 1
4 3,750 625 0
5 40,625 3,125 1
6 406,250 15,625 0
7 4 140,625 78,125 1
8 4 1,406,250 390,625 0
9 4 14,062,500 1,953,125 0
10 41 40,625,000 9,765,625 0
11 41 406,250,000 48,828,125 0
12 41 4,062,500,000 244,140,625 0
13 41 41,845,703,125 1,220,703,125 1
14 414 18,457,031,250 6,103,515,625 0
15 414 184,570,312,500 30,517,578,125 0
16 414 1,998,291,015,625 152,587,890,625 1
17 4142 0,745,849,609,375 762,939,453,125 1
You can use multiply and divide approach to reduce the floating point arithmetic.
1 0 1 1
Which is equivalent to 1*2^0+0*2^1+2^(-2)+2^(-3) can be simplified to (1*2^3+0*2^2+1*2^1+1*2^0)/(2^3) only division remains floating point arithmetic rest all is integer arithmetic operation. Multiplication by 2 can be implemented through left shift.

Floating point conversion - Binary -> decimal

Here's the number I'm working on
1 01110 001 = ____
1 sign bit, 5 exp bits, 3 fraction bits
bias = 15
Here's my current process, hopefully you can tell me where I'm missing something
Convert binary exponent to decimal
01110 = 14
Subtract bias
14 - 15 = -1
Multiply fraction bits by result
0.001 * 2^-1 = 0.0001
Convert to decimal
.0001 = 1/16
The sign bit is 1 so my result is -1/16, however the given answer is -9/16. Would anyone mind explaining where the extra 8 in the fraction is coming from?
You seem to have the correct concept, including an understanding of the excess-N representation, but you're missing a crucial point.
The 3 bits used to encode the fractional part of the magnitude are 001, but there is an implicit 1. preceding the fraction bits, so the full magnitude is actually 1.001, which can be represented as an improper fraction as 1+1/8 => 9/8.
2^(-1) is the same as 1/(2^1), or 1/2.
9/8 * 1/2 = 9/16. Take the sign bit into account, and you arrive at the answer -9/16.
For normalized floating point representation, the Mantissa (fractional bits) = 1 + f. This is sometimes called an implied leading 1 representation. This is a trick for getting an additional bit of precision for free since we can always adjust the exponent E so that significant M is in the range 1<=M < 2 ...
You are almost correct but must take into consideration the implied 1. If it is denormalized (meaning the exponent bits are all 0s) you do not add an implied 1.
I would solve this problem as such...
1 01110 001
bias = 2^(k-1) -1 = 14
Exponent = e - bias
14 - 15 = -1
Take the fractional bits ->> 001
Add the implied 1 ->> 1.001
Shift it by the exponent, which is -1. Becomes .1001
Count up the values, 1(1/2) + 0(1/4) + 0(1/8) + 1(1/16) = 9/16
With the a negative sign bit it becomes -9/16
hope that helps!

Number of bits assigned for double data type

How many bits out of 64 is assigned to integer part and fractional part in double. Or is there any rule to specify it?
Note: I know I already replied with a comment. This is for my own benefit as much as the OPs; I always learn something new when I try to explain it.
Floating-point values (regardless of precision) are represented as follows:
sign * significand * βexp
where sign is 1 or -1, β is the base, exp is an integer exponent, and significand is a fraction. In this case, β is 2. For example, the real value 3.0 can be represented as 1.102 * 21, or 0.112 * 22, or even 0.0112 * 23.
Remember that a binary number is a sum of powers of 2, with powers decreasing from the left. For example, 1012 is equivalent to 1 * 22 + 0 * 21 + 1 * 20, which gives us the value 5. You can extend that past the radix point by using negative powers of 2, so 101.112 is equivalent to
1 * 22 + 0 * 21 + 1 * 20 + 1 * 2-1 + 1 * 2-2
which gives us the decimal value 5.75. A floating-point number is normalized such that there's a single non-zero digit prior to the radix point, so instead of writing 5.75 as 101.112, we'd write it as 1.01112 * 22
How is this encoded in a 32-bit or 64-bit binary format? The exact format depends on the platform; most modern platforms use the IEEE-754 specification (which also specifies the algorithms for floating-point arithmetic, as well as special values as infinity and Not A Number (NaN)), however some older platforms may use their own proprietary format (such as VAX G and H extended-precision floats). I think x86 also has a proprietary 80-bit format for intermediate calculations.
The general layout looks something like the following:
seeeeeeee...ffffffff....
where s represents the sign bit, e represents bits devoted to the exponent, and f represents bits devoted to the significand or fraction. The IEEE-754 32-bit single-precision layout is
seeeeeeeefffffffffffffffffffffff
This gives us an 8-bit exponent (which can represent the values -126 through 127) and a 22-bit significand (giving us roughly 6 to 7 significant decimal digits). A 0 in the sign bit represents a positive value, 1 represents negative. The exponent is encoded such that 000000012 represents -126, 011111112 represents 0, and 111111102 represents 127 (000000002 is reserved for representing 0 and "denormalized" numbers, while 111111112 is reserved for representing infinity and NaN). This format also assumes a hidden leading fraction bit that's always set to 1. Thus, our value 5.75, which we represent as 1.01112 * 22, would be encoded in a 32-bit single-precision float as
01000000101110000000000000000000
|| || |
|| |+----------+----------+
|| | |
|+--+---+ +------------ significand (1.0111, hidden leading bit)
| |
| +---------------------------- exponent (2)
+-------------------------------- sign (0, positive)
The IEEE-754 double-precision float uses 11 bits for the exponent (-1022 through 1023) and 52 bits for the significand. I'm not going to bother writing that out (this post is turning into a novel as it is).
Floating-point numbers have a greater range than integers because of the exponent; the exponent 127 only takes 8 bits to encode, but 2127 represents a 38-digit decimal number. The more bits in the exponent, the greater the range of values that can be represented. The precision (the number of significant digits) is determined by the number of bits in the significand. The more bits in the significand, the more significant digits you can represent.
Most real values cannot be represented exactly as a floating-point number; you cannot squeeze an infinite number of values into a finite number of bits. Thus, there are gaps between representable floating point values, and most values will be approximations. To illustrate the problem, let's look at an 8-bit "quarter-precision" format:
seeeefff
This gives us an exponent between -7 and 8 (we're not going to worry about special values like infinity and NaN) and a 3-bit significand with a hidden leading bit. The larger our exponent gets, the wider the gap between representable values gets. Here's a table showing the issue. The left column is the significand; each additional column shows the values we can represent for the given exponent:
sig -1 0 1 2 3 4 5
--- ---- ----- ----- ----- ----- ----- ----
000 0.5 1 2 4 8 16 32
001 0.5625 1.125 2.25 4.5 9 18 36
010 0.625 1.25 2.5 5 10 20 40
011 0.6875 1.375 2.75 5.5 11 22 44
100 0.75 1.5 3 6 12 24 48
101 0.8125 1.625 3.25 6.5 13 26 52
110 0.875 1.75 3.5 7 14 28 56
111 0.9375 1.875 3.75 7.5 15 30 60
Note that as we move towards larger values, the gap between representable values gets larger. We can represent 8 values between 0.5 and 1.0, with a gap of 0.0625 between each. We can represent 8 values between 1.0 and 2.0, with a gap of 0.125 between each. We can represent 8 values between 2.0 and 4.0, with a gap of 0.25 in between each. And so on. Note that we can represent all the positive integers up to 16, but we cannot represent the value 17 in this format; we simply don't have enough bits in the significand to do so. If we add the values 8 and 9 in this format, we'll get 16 as a result, which is a rounding error. If that result is used in any other computation, that rounding error will be compounded.
Note that some values cannot be represented exactly no matter how many bits you have in the significand. Just like 1/3 gives us the non-terminating decimal fraction 0.333333..., 1/10 gives us the non-terminating binary fraction 1.10011001100.... We would need an infinite number of bits in the significand to represent that value.
a double on a 64 bit machine, has one sign bit, 11 exponent bits and 52 fractional bits.
think (1 sign bit) * (52 fractional bits) ^ (11 exponent bits)

Convert between formats with different number of bits for exponent and fractional part

I am trying to refresh on floats. I am reading an exercise that asks to convert from format A having: k=3, 4 bits fraction and Bias=3 to format B having k=4, 3 bits fraction and Bias 7.
We should round when necessary.
Example between formats:
011 0000 (Value = 1) =====> 0111 000 (Value = 1)
010 1001 (Value = 25/32) =====> 0110 100 (Value = 3/4 Rounded down)
110 1111 (Value = 31/2) =====> 1011 000 (Value = 16 Rounded up)
Problem: I can not figure out how the conversion works. First of all I managed to do it correctly in some case but my approach was to convert the bit pattern of format A to the decimal value and then to express that value in the bit pattern of format B.
But is there a way to somehow go from one bit pattern to the other without doing this conversion, just knowing that we extend the e by 1 bit and reduce the fraction by 1?
But is there a way to somehow go from one bit pattern to the other without doing this conversion, just knowing that we extend the e by 1 bit and reduce the fraction by 1?
Yes, and this is much simpler than going through the decimal value (which is only correct if you convert to the exact decimal value and not an approximation).
011 0000 (Value = 1)
represents 1.0000 * 23-3
is really 1.0 * 20 in “natural” binary
represents 1.000 * 27-7 to pre-format for the destination format
=====> 0111 000 (Value = 1)
Second example:
010 1001 (Value = 25/32)
represents 1.1001 * 22-3
is really 1.1001 * 2-1
rounds to 1.100 * 2-1 when we suppress one digit, because of “ties-to-even”
is 1.100 * 26-7 pre-formated
=====> 0110 100 (Value = 3/4 Rounded down)
Third example:
110 1111 (Value = 31/2)
represents 1.1111 * 26-3
is really 1.1111 * 23
rounds to 10.000 * 23 when we suppress one digit, because “ties-to-even” means “up” here and the carry propagates a long way
renormalizes into 1.000 * 24
is 1.000 * 211-7 pre-formated
=====> 1011 000 (Value = 16 Rounded up)
Examples 2 and 3 are “halfway cases”. Well, rounding from 4-bit fractions to 3-bit fractions, 50% of examples will be halfway cases anyway.
In example 2, 1.1001 is as close to 1.100 as it is to 1.101. So how is the result chosen? The one that is chosen is the one that ends with 0. Here, 1.100.

Resources