How do you print out an IEEE754 number (without printf)? - c

For the purposes of this question, I do not have the ability to use printf facilities (I can't tell you why, unfortunately, but let's just assume for now that I know what I'm doing).
For an IEEE754 single precision number, you have the following bits:
SEEE EEEE EFFF FFFF FFFF FFFF FFFF FFFF
where S is the sign, E is the exponent and F is the fraction.
Printing the sign is relatively easy for all cases, as is catching all the special cases like NaN (E == 0xff, F != 0), Inf (E == 0xff, F == 0) and 0 (E == 0, F == 0, considered special just because the exponent bias isn't used in that case).
I have two questions.
The first is how best to turn denormalised numbers (where E == 0, F != 0) into normalised numbers (where 1 <= E <= 0xfe)? I suspect this will be necessary to simplify the answer to the next question (but I could be wrong so feel free to educate me).
The second question is how to print out the normalised numbers. I want to be able to print them out in two ways, exponential like -3.74195E3 and non-exponential like 3741.95. Although, just looking at those two side-by-side, it should be fairly easy to turn the former into the latter by just moving the decimal point around. So let's just concentrate on the exponential form.
I have a vague recollection of an algorithm I used long ago for printing out PI where you used one of the ever-reducing formulae and kept an upper and lower limit on the possibilities, outputting a digit when both limits agreed, and shifting the calculation by a factor of 10 (so when the upper and lower limits were 3.2364 and 3.1234, you could output the 3 and adjust for that in the calculation).
But it's been a long time since I did that so I don't even know if that's a suitable approach to take here. It seems so since the value of each bit is half that of the previous bit when moving through the fractional part (1/2, 1/4, 1/8 and so on).
I would really prefer not to have to go trudging through printf source code unless absolutely necessary so, if anyone can help out with this, I'll be eternally grateful.

If you want to get exact results for every conversion, you'll have to use arbitrary-precision arithmetic, as done in printf() implementations. If you want to get results that are "close," perhaps differing only in their least significant digit(s), then a very simple double-precision based algorithm will suffice: for the integer part, repeatedly divide by ten and append the remainders to form the decimal string (in reverse); for the fractional part, repeatedly multiply by ten and subtract off the integer parts to form the decimal string.
I recently wrote an article about this method: http://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/ . It does not print scientific notation, but that should be trivial to add. The algorithm prints subnormal numbers (the ones I printed came out accurately, but you'd have to do more thorough testing).

Denormalized numbers cannot be turned into normalized numbers of the same floating point type. The equivalent normalized number's exponent will be too small to be represented by the exponent.
To print normalized numbers, one silly way I can think of is to repeatedly multiply by 10 (well, for the fractional part).

The first thing you need to do is convert the exponent to decimal (since presumably that's what you want the output in) using logarithms. You take the fraction of that result and multiply the mantissa by the exp10 of that fraction, and then convert that to decimal characters. From there you just need to insert the decimal point in the appropriate location, shifted by the now-decimal exponent.

There is a paper by G. Steele describing in more details an algorithm which seems based on the same principle as the one you outline. If memory serve, there are time when you are forced to use unbounded precision arithmetic. (I think it is How to print floating-point numbers accurately but citeseer is currently down from here, I can't confirm and google results are polluted by a retrospective paper by the same from 20 years later).

Related

Problem of a simple float number serialization example

I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
    ½ → 1734/3467 → 1734
    ⅓ → 1155/3467 → 1155
    0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
    1734 → 1734 ÷ 3467 → 0.500144
    1155 → 1155 ÷ 3467 → 0.333141
    433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...

Algorithm for printing decimal value of a huge(over 128bits) binary number?

TLDR, at the bottom :)
Brief:
I am in a process of creating an basic arithmetic library(addition, subtraction, ...) for handling huge numbers. One of the problem i am facing is printing these huge binary numbers into decimal.
I have huge binary number stored in an array of uint64_t. e.g.
uint64_t a[64] = {0};
Now, the goal is to print the 64*64bits binary number in the console/file as its decimal value.
Initial Work:
To elaborate the problem I want to describe how I printed hex value.
int i;
int s = 1;
a[1] = (uint64_t)0xFF;
for(i = s; i>= 0; i--)
{
printf("0x%08llX, ", a[i]);
}
Output:
0x000000FF, 0x00000000,
Similarly for printing OCT value I can just take LSB 3 bits from a[64], print decimal equivalent of those bits, 3 bits right shift all the bits of a[64] and keep repeating until all the values of a[64] has been printed. (print in revers order to keep first Oct digit on the right)
I can print Hex and Oct value of a binary of unlimited size just by repeating this unit algorithm, but I could not find/develop one for Decimal which I can repeat over and over again to print a[64](or something bigger).
What I have thought of:
My initial idea was to keep subtracting
max_64 =(uint64)10000000000000000000; //(i.e.10^19)
the biggest multiple of 10 inside uint64_t, from a until the value inside a is smaller than max_64 (which is basically equivalent of rem_64 = a%max_64 ) and print the rem_64 value using
printf("%019llu",rem_64);
which is the 1st 19 decimal digits of the number a.
Then do an arithmetic operation similar to (not the code):
a = a/max_64; /* Integer division(no fractional part) to remove right most 19 dec digits from 'a' */
and keep repeating and printing 19 decimal digits. (print in such a way that first found 19 digits are on the right, then next 19 digits on its left and so on...).
The problem is this process is to long and I don't want to use all these to just print the dec value. And was looking for a process which avoids using these huge time consuming arithmetic operations.
What I believe is that there must be a way to print huge size just by repeating an algorithm (similar to how Hex and Oct can be printed) and I hope someone could point me to the right direction.
What my library can do(so far):
Add (Using Full-Adder)
Sub (Using Full-subtractor)
Compare (by comparing array size and comparing array elements)
Div (Integer division, no fractional part)
Modulus (%)
Multiplication (basically adding from several times :( )
I will write code for other operations if needed, but I would like to implement the printing function independent of the library if possible.
Consider the problem like this:
You have been given a binary number X of n bits (1<=n<=64*64) you have to print out X in decimal. You can use existing library if absolutely needed but better if unused.
TLDR:
Any code, reference or unit algorithm which I can repeat for printing decimal value of a binary of too big and/or unknown size would be much helpful. Emphasis on algorithm i.e. I don't need a code if some one could describe a process I will be able to implement it. Thanks in advance.
When faced with such doubts, and given that there are many bigint libraries out there, it is interesting to look into their code. I had a look at Java's BigInteger, which has a toString method, and they do two things:
for small numbers, they bite the bullet and do something similar to what you proposed - straightforward link-by-link base conversion, outputting decimal numbers in each step.
for large numbers, they use the recursive Schönhage algorithm, which they quote in the comments as being referred to in, among other places,
Knuth, Donald, The Art of Computer Programming, Vol. 2, Answers to
Exercises (4.4) Question 14.

Why can C represent some floating points but not others with same amount of decimals [duplicate]

There have been several questions posted to SO about floating-point representation. For example, the decimal number 0.1 doesn't have an exact binary representation, so it's dangerous to use the == operator to compare it to another floating-point number. I understand the principles behind floating-point representation.
What I don't understand is why, from a mathematical perspective, are the numbers to the right of the decimal point any more "special" that the ones to the left?
For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
By contrast, if I move the decimal one place in the other direction to produce the number 610, I'm still in Exactopia. I can keep going in that direction (6100, 610000000, 610000000000000) and they're still exact, exact, exact. But as soon as the decimal crosses some threshold, the numbers are no longer exact.
What's going on?
Edit: to clarify, I want to stay away from discussion about industry-standard representations, such as IEEE, and stick with what I believe is the mathematically "pure" way. In base 10, the positional values are:
... 1000 100 10 1 1/10 1/100 ...
In binary, they would be:
... 8 4 2 1 1/2 1/4 1/8 ...
There are also no arbitrary limits placed on these numbers. The positions increase indefinitely to the left and to the right.
Decimal numbers can be represented exactly, if you have enough space - just not by floating binary point numbers. If you use a floating decimal point type (e.g. System.Decimal in .NET) then plenty of values which can't be represented exactly in binary floating point can be exactly represented.
Let's look at it another way - in base 10 which you're likely to be comfortable with, you can't express 1/3 exactly. It's 0.3333333... (recurring). The reason you can't represent 0.1 as a binary floating point number is for exactly the same reason. You can represent 3, and 9, and 27 exactly - but not 1/3, 1/9 or 1/27.
The problem is that 3 is a prime number which isn't a factor of 10. That's not an issue when you want to multiply a number by 3: you can always multiply by an integer without running into problems. But when you divide by a number which is prime and isn't a factor of your base, you can run into trouble (and will do so if you try to divide 1 by that number).
Although 0.1 is usually used as the simplest example of an exact decimal number which can't be represented exactly in binary floating point, arguably 0.2 is a simpler example as it's 1/5 - and 5 is the prime that causes problems between decimal and binary.
Side note to deal with the problem of finite representations:
Some floating decimal point types have a fixed size like System.Decimal others like java.math.BigDecimal are "arbitrarily large" - but they'll hit a limit at some point, whether it's system memory or the theoretical maximum size of an array. This is an entirely separate point to the main one of this answer, however. Even if you had a genuinely arbitrarily large number of bits to play with, you still couldn't represent decimal 0.1 exactly in a floating binary point representation. Compare that with the other way round: given an arbitrary number of decimal digits, you can exactly represent any number which is exactly representable as a floating binary point.
For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
Let's step away for a moment from the particulars of bases 10 and 2. Let's ask - in base b, what numbers have terminating representations, and what numbers don't? A moment's thought tells us that a number x has a terminating b-representation if and only if there exists an integer n such that x b^n is an integer.
So, for example, x = 11/500 has a terminating 10-representation, because we can pick n = 3 and then x b^n = 22, an integer. However x = 1/3 does not, because whatever n we pick we will not be able to get rid of the 3.
This second example prompts us to think about factors, and we can see that for any rational x = p/q (assumed to be in lowest terms), we can answer the question by comparing the prime factorisations of b and q. If q has any prime factors not in the prime factorisation of b, we will never be able to find a suitable n to get rid of these factors.
Thus for base 10, any p/q where q has prime factors other than 2 or 5 will not have a terminating representation.
So now going back to bases 10 and 2, we see that any rational with a terminating 10-representation will be of the form p/q exactly when q has only 2s and 5s in its prime factorisation; and that same number will have a terminating 2-representatiion exactly when q has only 2s in its prime factorisation.
But one of these cases is a subset of the other! Whenever
q has only 2s in its prime factorisation
it obviously is also true that
q has only 2s and 5s in its prime factorisation
or, put another way, whenever p/q has a terminating 2-representation, p/q has a terminating 10-representation. The converse however does not hold - whenever q has a 5 in its prime factorisation, it will have a terminating 10-representation , but not a terminating 2-representation. This is the 0.1 example mentioned by other answers.
So there we have the answer to your question - because the prime factors of 2 are a subset of the prime factors of 10, all 2-terminating numbers are 10-terminating numbers, but not vice versa. It's not about 61 versus 6.1 - it's about 10 versus 2.
As a closing note, if by some quirk people used (say) base 17 but our computers used base 5, your intuition would never have been led astray by this - there would be no (non-zero, non-integer) numbers which terminated in both cases!
The root (mathematical) reason is that when you are dealing with integers, they are countably infinite.
Which means, even though there are an infinite amount of them, we could "count out" all of the items in the sequence, without skipping any. That means if we want to get the item in the 610000000000000th position in the list, we can figure it out via a formula.
However, real numbers are uncountably infinite. You can't say "give me the real number at position 610000000000000" and get back an answer. The reason is because, even between 0 and 1, there are an infinite number of values, when you are considering floating-point values. The same holds true for any two floating point numbers.
More info:
http://en.wikipedia.org/wiki/Countable_set
http://en.wikipedia.org/wiki/Uncountable_set
Update:
My apologies, I appear to have misinterpreted the question. My response is about why we cannot represent every real value, I hadn't realized that floating point was automatically classified as rational.
To repeat what I said in my comment to Mr. Skeet: we can represent 1/3, 1/9, 1/27, or any rational in decimal notation. We do it by adding an extra symbol. For example, a line over the digits that repeat in the decimal expansion of the number. What we need to represent decimal numbers as a sequence of binary numbers are 1) a sequence of binary numbers, 2) a radix point, and 3) some other symbol to indicate the repeating part of the sequence.
Hehner's quote notation is a way of doing this. He uses a quote symbol to represent the repeating part of the sequence. The article: http://www.cs.toronto.edu/~hehner/ratno.pdf and the Wikipedia entry: http://en.wikipedia.org/wiki/Quote_notation.
There's nothing that says we can't add a symbol to our representation system, so we can represent decimal rationals exactly using binary quote notation, and vice versa.
BCD - Binary-coded Decimal - representations are exact. They are not very space-efficient, but that's a trade-off you have to make for accuracy in this case.
This is a good question.
All your question is based on "how do we represent a number?"
ALL the numbers can be represented with decimal representation or with binary (2's complement) representation. All of them !!
BUT some (most of them) require infinite number of elements ("0" or "1" for the binary position, or "0", "1" to "9" for the decimal representation).
Like 1/3 in decimal representation (1/3 = 0.3333333... <- with an infinite number of "3")
Like 0.1 in binary ( 0.1 = 0.00011001100110011.... <- with an infinite number of "0011")
Everything is in that concept. Since your computer can only consider finite set of digits (decimal or binary), only some numbers can be exactly represented in your computer...
And as said Jon, 3 is a prime number which isn't a factor of 10, so 1/3 cannot be represented with a finite number of elements in base 10.
Even with arithmetic with arbitrary precision, the numbering position system in base 2 is not able to fully describe 6.1, although it can represent 61.
For 6.1, we must use another representation (like decimal representation, or IEEE 854 that allows base 2 or base 10 for the representation of floating-point values)
If you make a big enough number with floating point (as it can do exponents), then you'll end up with inexactness in front of the decimal point, too. So I don't think your question is entirely valid because the premise is wrong; it's not the case that shifting by 10 will always create more precision, because at some point the floating point number will have to use exponents to represent the largeness of the number and will lose some precision that way as well.
It's the same reason you cannot represent 1/3 exactly in base 10, you need to say 0.33333(3). In binary it is the same type of problem but just occurs for different set of numbers.
(Note: I'll append 'b' to indicate binary numbers here. All other numbers are given in decimal)
One way to think about things is in terms of something like scientific notation. We're used to seeing numbers expressed in scientific notation like, 6.022141 * 10^23. Floating point numbers are stored internally using a similar format - mantissa and exponent, but using powers of two instead of ten.
Your 61.0 could be rewritten as 1.90625 * 2^5, or 1.11101b * 2^101b with the mantissa and exponents. To multiply that by ten and (move the decimal point), we can do:
(1.90625 * 2^5) * (1.25 * 2^3) = (2.3828125 * 2^8) = (1.19140625 * 2^9)
or in with the mantissa and exponents in binary:
(1.11101b * 2^101b) * (1.01b * 2^11b) = (10.0110001b * 2^1000b) = (1.00110001b * 2^1001b)
Note what we did there to multiply the numbers. We multiplied the mantissas and added the exponents. Then, since the mantissa ended greater than two, we normalized the result by bumping the exponent. It's just like when we adjust the exponent after doing an operation on numbers in decimal scientific notation. In each case, the values that we worked with had a finite representation in binary, and so the values output by the basic multiplication and addition operations also produced values with a finite representation.
Now, consider how we'd divide 61 by 10. We'd start by dividing the mantissas, 1.90625 and 1.25. In decimal, this gives 1.525, a nice short number. But what is this if we convert it to binary? We'll do it the usual way -- subtracting out the largest power of two whenever possible, just like converting integer decimals to binary, but we'll use negative powers of two:
1.525 - 1*2^0 --> 1
0.525 - 1*2^-1 --> 1
0.025 - 0*2^-2 --> 0
0.025 - 0*2^-3 --> 0
0.025 - 0*2^-4 --> 0
0.025 - 0*2^-5 --> 0
0.025 - 1*2^-6 --> 1
0.009375 - 1*2^-7 --> 1
0.0015625 - 0*2^-8 --> 0
0.0015625 - 0*2^-9 --> 0
0.0015625 - 1*2^-10 --> 1
0.0005859375 - 1*2^-11 --> 1
0.00009765625...
Uh oh. Now we're in trouble. It turns out that 1.90625 / 1.25 = 1.525, is a repeating fraction when expressed in binary: 1.11101b / 1.01b = 1.10000110011...b Our machines only have so many bits to hold that mantissa and so they'll just round the fraction and assume zeroes beyond a certain point. The error you see when you divide 61 by 10 is the difference between:
1.100001100110011001100110011001100110011...b * 2^10b
and, say:
1.100001100110011001100110b * 2^10b
It's this rounding of the mantissa that leads to the loss of precision that we associate with floating point values. Even when the mantissa can be expressed exactly (e.g., when just adding two numbers), we can still get numeric loss if the mantissa needs too many digits to fit after normalizing the exponent.
We actually do this sort of thing all the time when we round decimal numbers to a manageable size and just give the first few digits of it. Because we express the result in decimal it feels natural. But if we rounded a decimal and then converted it to a different base, it'd look just as ugly as the decimals we get due to floating point rounding.
I'm surprised no one has stated this yet: use continued fractions. Any rational number can be represented finitely in binary this way.
Some examples:
1/3 (0.3333...)
0; 3
5/9 (0.5555...)
0; 1, 1, 4
10/43 (0.232558139534883720930...)
0; 4, 3, 3
9093/18478 (0.49209871198181621387596060179673...)
0; 2, 31, 7, 8, 5
From here, there are a variety of known ways to store a sequence of integers in memory.
In addition to storing your number with perfect accuracy, continued fractions also have some other benefits, such as best rational approximation. If you decide to terminate the sequence of numbers in a continued fraction early, the remaining digits (when recombined to a fraction) will give you the best possible fraction. This is how approximations to pi are found:
Pi's continued fraction:
3; 7, 15, 1, 292 ...
Terminating the sequence at 1, this gives the fraction:
355/113
which is an excellent rational approximation.
In the equation
2^x = y ;
x = log(y) / log(2)
Hence, I was just wondering if we could have a logarithmic base system for binary like,
2^1, 2^0, 2^(log(1/2) / log(2)), 2^(log(1/4) / log(2)), 2^(log(1/8) / log(2)),2^(log(1/16) / log(2)) ........
That might be able to solve the problem, so if you wanted to write something like 32.41 in binary, that would be
2^5 + 2^(log(0.4) / log(2)) + 2^(log(0.01) / log(2))
Or
2^5 + 2^(log(0.41) / log(2))
The problem is that you do not really know whether the number actually is exactly 61.0 . Consider this:
float a = 60;
float b = 0.1;
float c = a + b * 10;
What is the value of c? It is not exactly 61, because b is not really .1 because .1 does not have an exact binary representation.
The number 61.0 does indeed have an exact floating-point operation—but that's not true for all integers. If you wrote a loop that added one to both a double-precision floating point number and a 64-bit integer, eventually you'd reach a point where the 64-bit integer perfectly represents a number, but the floating point doesn't—because there aren't enough significant bits.
It's just much easier to reach the point of approximation on the right side of the decimal point. If you started writing out all the numbers in binary floating point, it'd make more sense.
Another way of thinking about it is that when you note that 61.0 is perfectly representable in base 10, and shifting the decimal point around doesn't change that, you're performing multiplication by powers of ten (10^1, 10^-1). In floating point, multiplying by powers of two does not affect the precision of the number. Try taking 61.0 and dividing it by three repeatedly for an illustration of how a perfectly precise number can lose its precise representation.
There's a threshold because the meaning of the digit has gone from integer to non-integer. To represent 61, you have 6*10^1 + 1*10^0; 10^1 and 10^0 are both integers. 6.1 is 6*10^0 + 1*10^-1, but 10^-1 is 1/10, which is definitely not an integer. That's how you end up in Inexactville.
A parallel can be made of fractions and whole numbers. Some fractions eg 1/7 cannot be represented in decimal form without lots and lots of decimals. Because floating point is binary based the special cases change but the same sort of accuracy problems present themselves.
There are an infinite number of rational numbers, and a finite number of bits with which to represent them. See http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems.
you know integer numbers right? each bit represent 2^n
2^4=16
2^3=8
2^2=4
2^1=2
2^0=1
well its the same for floating point(with some distinctions) but the bits represent 2^-n
2^-1=1/2=0.5
2^-2=1/(2*2)=0.25
2^-3=0.125
2^-4=0.0625
Floating point binary representation:
sign Exponent Fraction(i think invisible 1 is appended to the fraction )
B11 B10 B9 B8 B7 B6 B5 B4 B3 B2 B1 B0
The high scoring answer above nailed it.
First you were mixing base 2 and base 10 in your question, then when you put a number on the right side that is not divisible into the base you get problems. Like 1/3 in decimal because 3 doesnt go into a power of 10 or 1/5 in binary which doesnt go into a power of 2.
Another comment though NEVER use equal with floating point numbers, period. Even if it is an exact representation there are some numbers in some floating point systems that can be accurately represented in more than one way (IEEE is bad about this, it is a horrible floating point spec to start with, so expect headaches). No different here 1/3 is not EQUAL to the number on your calculator 0.3333333, no matter how many 3's there are to the right of the decimal point. It is or can be close enough but is not equal. so you would expect something like 2*1/3 to not equal 2/3 depending on the rounding. Never use equal with floating point.
As we have been discussing, in floating point arithmetic, the decimal 0.1 cannot be perfectly represented in binary.
Floating point and integer representations provide grids or lattices for the numbers represented. As arithmetic is done, the results fall off the grid and have to be put back onto the grid by rounding. Example is 1/10 on a binary grid.
If we use binary coded decimal representation as one gentleman suggested, would we be able to keep numbers on the grid?
For a simple answer: The computer doesn't have infinite memory to store fraction (after representing the decimal number as the form of scientific notation). According to IEEE 754 standard for double-precision floating-point numbers, we only have a limit of 53 bits to store fraction.
For more info: http://mathcenter.oxford.emory.edu/site/cs170/ieee754/
I will not bother to repeat what the other 20 answers have already summarized, so I will just answer briefly:
The answer in your content:
Why can't base two numbers represent certain ratios exactly?
For the same reason that decimals are insufficient to represent certain ratios, namely, irreducible fractions with denominators containing prime factors other than two or five which will always have an indefinite string in at least the mantissa of its decimal expansion.
Why can't decimal numbers be represented exactly in binary?
This question at face value is based on a misconception regarding values themselves. No number system is sufficient to represent any quantity or ratio in a manner that the thing itself tells you that it is both a quantity, and at the same time also gives the interpretation in and of itself about the intrinsic value of the representation. As such, all quantitative representations, and models in general, are symbolic and can only be understood a posteriori, namely, after one has been taught how to read and interpret these numbers.
Since models are subjective things that are true insofar as they reflect reality, we do not strictly need to interpret a binary string as sums of negative and positive powers of two. Instead, one may observe that we can create an arbitrary set of symbols that use base two or any other base to represent any number or ratio exactly. Just consider that we can refer to all of infinity using a single word and even a single symbol without "showing infinity" itself.
As an example, I am designing a binary encoding for mixed numbers so that I can have more precision and accuracy than an IEEE 754 float. At the time of writing this, the idea is to have a sign bit, a reciprocal bit, a certain number of bits for a scalar to determine how much to "magnify" the fractional portion, and then the remaining bits are divided evenly between the integer portion of a mixed number, and the latter a fixed-point number which, if the reciprocal bit is set, should be interpreted as one divided by that number. This has the benefit of allowing me to represent numbers with infinite decimal expansions by using their reciprocals which do have terminating decimal expansions, or alternatively, as a fraction directly, potentially as an approximation, depending on my needs.
You can't represent 0.1 exactly in binary for the same reason you can't measure 0.1 inch using a conventional English ruler.
English rulers, like binary fractions, are all about halves. You can measure half an inch, or a quarter of an inch (which is of course half of a half), or an eighth, or a sixteenth, etc.
If you want to measure a tenth of an inch, though, you're out of luck. It's less than an eighth of an inch, but more than a sixteenth. If you try to get more exact, you find that it's a little more than 3/32, but a little less than 7/64. I've never seen an actual ruler that had gradations finer than 64ths, but if you do the math, you'll find that 1/10 is less than 13/128, and it's more than 25/256, and it's more than 51/512. You can keep going finer and finer, to 1024ths and 2048ths and 4096ths and 8192nds, but you will never find an exact marking, even on an infinitely-fine base-2 ruler, that exactly corresponds to 1/10, or 0.1.
You will find something interesting, though. Let's look at all the approximations I've listed, and for each one, record explicitly whether 0.1 is less or greater:
fraction
decimal
0.1 is...
as 0/1
1/2
0.5
less
0
1/4
0.25
less
0
1/8
0.125
less
0
1/16
0.0625
greater
1
3/32
0.09375
greater
1
7/64
0.109375
less
0
13/128
0.1015625
less
0
25/256
0.09765625
greater
1
51/512
0.099609375
greater
1
103/1024
0.1005859375
less
0
205/2048
0.10009765625
less
0
409/4096
0.099853515625
greater
1
819/8192
0.0999755859375
greater
1
Now, if you read down the last column, you get 0001100110011. It's no coincidence that the infinitely-repeating binary fraction for 1/10 is 0.0001100110011...

Do the rules of arithmetic hold for addition in C?

I'm new to C, and I'm having such a hard time understanding this material. I really need help! Please someone help.
In arithmetic, the sum of any two positive integers is great than either:
(n+m) > n for n, m > 0
(n+m) > m for n, m > 0
C has an addition operator +. Does this arithmetic rule hold in C?
I know this is False. But can please someone explain to me why so, I can understand it? Please provide counter-example?
Thank you in advance.
(I won't solve this for you, but will provide some pointers.)
It is false for both integer and floating-point arithmetic, for different reasons.
Integers are susceptible to overflow.
Adding a very small floating-point number m to a very large number n returns n. Have a read of What Every Computer Scientist Should Know About Floating-Point Arithmetic.
It doesn't hold, since C's integers are not "abstract" infinititely-sized integers that the real integers (in mathematics) are.
In C, integers are discrete and digital, and implemented using a fixed number of bits. This leads to limited range, and problems when you go (try to) out of range. Typically integers will wrap, which is very "un-natural".
I brief search did not show up nice answers describing these, so I rather attempt to answer this nicely here, for beginners.
The answer is false, of course, but why so?
Integers
In C, or any programming language providing some kind of integer type, this type does not mean it in the mathematical sense. In mathematical sense non-negative integers range from 0 to infinity. A computer, however has only limited storage, so integers necessarily are constrained to something less than infinity.
This alone proves that a + b > a and a + b > b can not be true all the time, since it can be set up so both a and b is less than the largest number the computer can represent in it's storage, but a + b is larger than that.
What exactly happens here, depends. Some mentioned wraparound, but that's not necessarily the case. The C language the first place defines integer overflow to be an undefined behaviour, that whatever, including fire and smoke, may happen if the code happens to step on it (of course in the reality that won't happen, but interpreting the standard strictly it could, as well as the breach of the space-time continuum).
I won't describe how wraparound works here since it is beyond the scope of the problem itself.
Floating point
The case here is again just the same like for integers: the key to understand why mathematics don't fully apply here is that the computer has a limited storage.
Floating numbers in the computers memory are represented much like scientific notation: a mantissa, and an exponent. Both of these have a fixed limited range depending on the type of the floating point variable.
In base 10, you may conceive this like you have the exponent ranging from 10 ^ -10 to 10 ^ 10, and the mantissa having like 4 fraction digits after the decimal point, always normalized.
With this in mind, check these example additions:
1.2345 * (10 ^ 0) + 1.0237 * (10 ^ 5)
5.2345 * (10 ^ 10) + 6.7891 * (10 ^ 10)
The first is an example where the result will equal one of the input numbers while both were larger than zero. The second is an example where the result is out of range.
The floating point representation computers use however is capable to represent infinity, and two at that: positive infinity and negative infinity. So while the first example passes as a proof, the second does not, since that addition's result is positive infinity.
However with this in mind, you could produce an another proofing example:
3.1416 * (10 ^ 0) + (+ infinity)
Of course the result is positive infinity, no matter what you add it to. And of course positive infinity is not larger than positive infinity, so proved again.

Floating Point Number storage in c [duplicate]

There have been several questions posted to SO about floating-point representation. For example, the decimal number 0.1 doesn't have an exact binary representation, so it's dangerous to use the == operator to compare it to another floating-point number. I understand the principles behind floating-point representation.
What I don't understand is why, from a mathematical perspective, are the numbers to the right of the decimal point any more "special" that the ones to the left?
For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
By contrast, if I move the decimal one place in the other direction to produce the number 610, I'm still in Exactopia. I can keep going in that direction (6100, 610000000, 610000000000000) and they're still exact, exact, exact. But as soon as the decimal crosses some threshold, the numbers are no longer exact.
What's going on?
Edit: to clarify, I want to stay away from discussion about industry-standard representations, such as IEEE, and stick with what I believe is the mathematically "pure" way. In base 10, the positional values are:
... 1000 100 10 1 1/10 1/100 ...
In binary, they would be:
... 8 4 2 1 1/2 1/4 1/8 ...
There are also no arbitrary limits placed on these numbers. The positions increase indefinitely to the left and to the right.
Decimal numbers can be represented exactly, if you have enough space - just not by floating binary point numbers. If you use a floating decimal point type (e.g. System.Decimal in .NET) then plenty of values which can't be represented exactly in binary floating point can be exactly represented.
Let's look at it another way - in base 10 which you're likely to be comfortable with, you can't express 1/3 exactly. It's 0.3333333... (recurring). The reason you can't represent 0.1 as a binary floating point number is for exactly the same reason. You can represent 3, and 9, and 27 exactly - but not 1/3, 1/9 or 1/27.
The problem is that 3 is a prime number which isn't a factor of 10. That's not an issue when you want to multiply a number by 3: you can always multiply by an integer without running into problems. But when you divide by a number which is prime and isn't a factor of your base, you can run into trouble (and will do so if you try to divide 1 by that number).
Although 0.1 is usually used as the simplest example of an exact decimal number which can't be represented exactly in binary floating point, arguably 0.2 is a simpler example as it's 1/5 - and 5 is the prime that causes problems between decimal and binary.
Side note to deal with the problem of finite representations:
Some floating decimal point types have a fixed size like System.Decimal others like java.math.BigDecimal are "arbitrarily large" - but they'll hit a limit at some point, whether it's system memory or the theoretical maximum size of an array. This is an entirely separate point to the main one of this answer, however. Even if you had a genuinely arbitrarily large number of bits to play with, you still couldn't represent decimal 0.1 exactly in a floating binary point representation. Compare that with the other way round: given an arbitrary number of decimal digits, you can exactly represent any number which is exactly representable as a floating binary point.
For example, the number 61.0 has an exact binary representation because the integral portion of any number is always exact. But the number 6.10 is not exact. All I did was move the decimal one place and suddenly I've gone from Exactopia to Inexactville. Mathematically, there should be no intrinsic difference between the two numbers -- they're just numbers.
Let's step away for a moment from the particulars of bases 10 and 2. Let's ask - in base b, what numbers have terminating representations, and what numbers don't? A moment's thought tells us that a number x has a terminating b-representation if and only if there exists an integer n such that x b^n is an integer.
So, for example, x = 11/500 has a terminating 10-representation, because we can pick n = 3 and then x b^n = 22, an integer. However x = 1/3 does not, because whatever n we pick we will not be able to get rid of the 3.
This second example prompts us to think about factors, and we can see that for any rational x = p/q (assumed to be in lowest terms), we can answer the question by comparing the prime factorisations of b and q. If q has any prime factors not in the prime factorisation of b, we will never be able to find a suitable n to get rid of these factors.
Thus for base 10, any p/q where q has prime factors other than 2 or 5 will not have a terminating representation.
So now going back to bases 10 and 2, we see that any rational with a terminating 10-representation will be of the form p/q exactly when q has only 2s and 5s in its prime factorisation; and that same number will have a terminating 2-representatiion exactly when q has only 2s in its prime factorisation.
But one of these cases is a subset of the other! Whenever
q has only 2s in its prime factorisation
it obviously is also true that
q has only 2s and 5s in its prime factorisation
or, put another way, whenever p/q has a terminating 2-representation, p/q has a terminating 10-representation. The converse however does not hold - whenever q has a 5 in its prime factorisation, it will have a terminating 10-representation , but not a terminating 2-representation. This is the 0.1 example mentioned by other answers.
So there we have the answer to your question - because the prime factors of 2 are a subset of the prime factors of 10, all 2-terminating numbers are 10-terminating numbers, but not vice versa. It's not about 61 versus 6.1 - it's about 10 versus 2.
As a closing note, if by some quirk people used (say) base 17 but our computers used base 5, your intuition would never have been led astray by this - there would be no (non-zero, non-integer) numbers which terminated in both cases!
The root (mathematical) reason is that when you are dealing with integers, they are countably infinite.
Which means, even though there are an infinite amount of them, we could "count out" all of the items in the sequence, without skipping any. That means if we want to get the item in the 610000000000000th position in the list, we can figure it out via a formula.
However, real numbers are uncountably infinite. You can't say "give me the real number at position 610000000000000" and get back an answer. The reason is because, even between 0 and 1, there are an infinite number of values, when you are considering floating-point values. The same holds true for any two floating point numbers.
More info:
http://en.wikipedia.org/wiki/Countable_set
http://en.wikipedia.org/wiki/Uncountable_set
Update:
My apologies, I appear to have misinterpreted the question. My response is about why we cannot represent every real value, I hadn't realized that floating point was automatically classified as rational.
To repeat what I said in my comment to Mr. Skeet: we can represent 1/3, 1/9, 1/27, or any rational in decimal notation. We do it by adding an extra symbol. For example, a line over the digits that repeat in the decimal expansion of the number. What we need to represent decimal numbers as a sequence of binary numbers are 1) a sequence of binary numbers, 2) a radix point, and 3) some other symbol to indicate the repeating part of the sequence.
Hehner's quote notation is a way of doing this. He uses a quote symbol to represent the repeating part of the sequence. The article: http://www.cs.toronto.edu/~hehner/ratno.pdf and the Wikipedia entry: http://en.wikipedia.org/wiki/Quote_notation.
There's nothing that says we can't add a symbol to our representation system, so we can represent decimal rationals exactly using binary quote notation, and vice versa.
BCD - Binary-coded Decimal - representations are exact. They are not very space-efficient, but that's a trade-off you have to make for accuracy in this case.
This is a good question.
All your question is based on "how do we represent a number?"
ALL the numbers can be represented with decimal representation or with binary (2's complement) representation. All of them !!
BUT some (most of them) require infinite number of elements ("0" or "1" for the binary position, or "0", "1" to "9" for the decimal representation).
Like 1/3 in decimal representation (1/3 = 0.3333333... <- with an infinite number of "3")
Like 0.1 in binary ( 0.1 = 0.00011001100110011.... <- with an infinite number of "0011")
Everything is in that concept. Since your computer can only consider finite set of digits (decimal or binary), only some numbers can be exactly represented in your computer...
And as said Jon, 3 is a prime number which isn't a factor of 10, so 1/3 cannot be represented with a finite number of elements in base 10.
Even with arithmetic with arbitrary precision, the numbering position system in base 2 is not able to fully describe 6.1, although it can represent 61.
For 6.1, we must use another representation (like decimal representation, or IEEE 854 that allows base 2 or base 10 for the representation of floating-point values)
If you make a big enough number with floating point (as it can do exponents), then you'll end up with inexactness in front of the decimal point, too. So I don't think your question is entirely valid because the premise is wrong; it's not the case that shifting by 10 will always create more precision, because at some point the floating point number will have to use exponents to represent the largeness of the number and will lose some precision that way as well.
It's the same reason you cannot represent 1/3 exactly in base 10, you need to say 0.33333(3). In binary it is the same type of problem but just occurs for different set of numbers.
(Note: I'll append 'b' to indicate binary numbers here. All other numbers are given in decimal)
One way to think about things is in terms of something like scientific notation. We're used to seeing numbers expressed in scientific notation like, 6.022141 * 10^23. Floating point numbers are stored internally using a similar format - mantissa and exponent, but using powers of two instead of ten.
Your 61.0 could be rewritten as 1.90625 * 2^5, or 1.11101b * 2^101b with the mantissa and exponents. To multiply that by ten and (move the decimal point), we can do:
(1.90625 * 2^5) * (1.25 * 2^3) = (2.3828125 * 2^8) = (1.19140625 * 2^9)
or in with the mantissa and exponents in binary:
(1.11101b * 2^101b) * (1.01b * 2^11b) = (10.0110001b * 2^1000b) = (1.00110001b * 2^1001b)
Note what we did there to multiply the numbers. We multiplied the mantissas and added the exponents. Then, since the mantissa ended greater than two, we normalized the result by bumping the exponent. It's just like when we adjust the exponent after doing an operation on numbers in decimal scientific notation. In each case, the values that we worked with had a finite representation in binary, and so the values output by the basic multiplication and addition operations also produced values with a finite representation.
Now, consider how we'd divide 61 by 10. We'd start by dividing the mantissas, 1.90625 and 1.25. In decimal, this gives 1.525, a nice short number. But what is this if we convert it to binary? We'll do it the usual way -- subtracting out the largest power of two whenever possible, just like converting integer decimals to binary, but we'll use negative powers of two:
1.525 - 1*2^0 --> 1
0.525 - 1*2^-1 --> 1
0.025 - 0*2^-2 --> 0
0.025 - 0*2^-3 --> 0
0.025 - 0*2^-4 --> 0
0.025 - 0*2^-5 --> 0
0.025 - 1*2^-6 --> 1
0.009375 - 1*2^-7 --> 1
0.0015625 - 0*2^-8 --> 0
0.0015625 - 0*2^-9 --> 0
0.0015625 - 1*2^-10 --> 1
0.0005859375 - 1*2^-11 --> 1
0.00009765625...
Uh oh. Now we're in trouble. It turns out that 1.90625 / 1.25 = 1.525, is a repeating fraction when expressed in binary: 1.11101b / 1.01b = 1.10000110011...b Our machines only have so many bits to hold that mantissa and so they'll just round the fraction and assume zeroes beyond a certain point. The error you see when you divide 61 by 10 is the difference between:
1.100001100110011001100110011001100110011...b * 2^10b
and, say:
1.100001100110011001100110b * 2^10b
It's this rounding of the mantissa that leads to the loss of precision that we associate with floating point values. Even when the mantissa can be expressed exactly (e.g., when just adding two numbers), we can still get numeric loss if the mantissa needs too many digits to fit after normalizing the exponent.
We actually do this sort of thing all the time when we round decimal numbers to a manageable size and just give the first few digits of it. Because we express the result in decimal it feels natural. But if we rounded a decimal and then converted it to a different base, it'd look just as ugly as the decimals we get due to floating point rounding.
I'm surprised no one has stated this yet: use continued fractions. Any rational number can be represented finitely in binary this way.
Some examples:
1/3 (0.3333...)
0; 3
5/9 (0.5555...)
0; 1, 1, 4
10/43 (0.232558139534883720930...)
0; 4, 3, 3
9093/18478 (0.49209871198181621387596060179673...)
0; 2, 31, 7, 8, 5
From here, there are a variety of known ways to store a sequence of integers in memory.
In addition to storing your number with perfect accuracy, continued fractions also have some other benefits, such as best rational approximation. If you decide to terminate the sequence of numbers in a continued fraction early, the remaining digits (when recombined to a fraction) will give you the best possible fraction. This is how approximations to pi are found:
Pi's continued fraction:
3; 7, 15, 1, 292 ...
Terminating the sequence at 1, this gives the fraction:
355/113
which is an excellent rational approximation.
In the equation
2^x = y ;
x = log(y) / log(2)
Hence, I was just wondering if we could have a logarithmic base system for binary like,
2^1, 2^0, 2^(log(1/2) / log(2)), 2^(log(1/4) / log(2)), 2^(log(1/8) / log(2)),2^(log(1/16) / log(2)) ........
That might be able to solve the problem, so if you wanted to write something like 32.41 in binary, that would be
2^5 + 2^(log(0.4) / log(2)) + 2^(log(0.01) / log(2))
Or
2^5 + 2^(log(0.41) / log(2))
The problem is that you do not really know whether the number actually is exactly 61.0 . Consider this:
float a = 60;
float b = 0.1;
float c = a + b * 10;
What is the value of c? It is not exactly 61, because b is not really .1 because .1 does not have an exact binary representation.
The number 61.0 does indeed have an exact floating-point operation—but that's not true for all integers. If you wrote a loop that added one to both a double-precision floating point number and a 64-bit integer, eventually you'd reach a point where the 64-bit integer perfectly represents a number, but the floating point doesn't—because there aren't enough significant bits.
It's just much easier to reach the point of approximation on the right side of the decimal point. If you started writing out all the numbers in binary floating point, it'd make more sense.
Another way of thinking about it is that when you note that 61.0 is perfectly representable in base 10, and shifting the decimal point around doesn't change that, you're performing multiplication by powers of ten (10^1, 10^-1). In floating point, multiplying by powers of two does not affect the precision of the number. Try taking 61.0 and dividing it by three repeatedly for an illustration of how a perfectly precise number can lose its precise representation.
There's a threshold because the meaning of the digit has gone from integer to non-integer. To represent 61, you have 6*10^1 + 1*10^0; 10^1 and 10^0 are both integers. 6.1 is 6*10^0 + 1*10^-1, but 10^-1 is 1/10, which is definitely not an integer. That's how you end up in Inexactville.
A parallel can be made of fractions and whole numbers. Some fractions eg 1/7 cannot be represented in decimal form without lots and lots of decimals. Because floating point is binary based the special cases change but the same sort of accuracy problems present themselves.
There are an infinite number of rational numbers, and a finite number of bits with which to represent them. See http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems.
you know integer numbers right? each bit represent 2^n
2^4=16
2^3=8
2^2=4
2^1=2
2^0=1
well its the same for floating point(with some distinctions) but the bits represent 2^-n
2^-1=1/2=0.5
2^-2=1/(2*2)=0.25
2^-3=0.125
2^-4=0.0625
Floating point binary representation:
sign Exponent Fraction(i think invisible 1 is appended to the fraction )
B11 B10 B9 B8 B7 B6 B5 B4 B3 B2 B1 B0
The high scoring answer above nailed it.
First you were mixing base 2 and base 10 in your question, then when you put a number on the right side that is not divisible into the base you get problems. Like 1/3 in decimal because 3 doesnt go into a power of 10 or 1/5 in binary which doesnt go into a power of 2.
Another comment though NEVER use equal with floating point numbers, period. Even if it is an exact representation there are some numbers in some floating point systems that can be accurately represented in more than one way (IEEE is bad about this, it is a horrible floating point spec to start with, so expect headaches). No different here 1/3 is not EQUAL to the number on your calculator 0.3333333, no matter how many 3's there are to the right of the decimal point. It is or can be close enough but is not equal. so you would expect something like 2*1/3 to not equal 2/3 depending on the rounding. Never use equal with floating point.
As we have been discussing, in floating point arithmetic, the decimal 0.1 cannot be perfectly represented in binary.
Floating point and integer representations provide grids or lattices for the numbers represented. As arithmetic is done, the results fall off the grid and have to be put back onto the grid by rounding. Example is 1/10 on a binary grid.
If we use binary coded decimal representation as one gentleman suggested, would we be able to keep numbers on the grid?
For a simple answer: The computer doesn't have infinite memory to store fraction (after representing the decimal number as the form of scientific notation). According to IEEE 754 standard for double-precision floating-point numbers, we only have a limit of 53 bits to store fraction.
For more info: http://mathcenter.oxford.emory.edu/site/cs170/ieee754/
I will not bother to repeat what the other 20 answers have already summarized, so I will just answer briefly:
The answer in your content:
Why can't base two numbers represent certain ratios exactly?
For the same reason that decimals are insufficient to represent certain ratios, namely, irreducible fractions with denominators containing prime factors other than two or five which will always have an indefinite string in at least the mantissa of its decimal expansion.
Why can't decimal numbers be represented exactly in binary?
This question at face value is based on a misconception regarding values themselves. No number system is sufficient to represent any quantity or ratio in a manner that the thing itself tells you that it is both a quantity, and at the same time also gives the interpretation in and of itself about the intrinsic value of the representation. As such, all quantitative representations, and models in general, are symbolic and can only be understood a posteriori, namely, after one has been taught how to read and interpret these numbers.
Since models are subjective things that are true insofar as they reflect reality, we do not strictly need to interpret a binary string as sums of negative and positive powers of two. Instead, one may observe that we can create an arbitrary set of symbols that use base two or any other base to represent any number or ratio exactly. Just consider that we can refer to all of infinity using a single word and even a single symbol without "showing infinity" itself.
As an example, I am designing a binary encoding for mixed numbers so that I can have more precision and accuracy than an IEEE 754 float. At the time of writing this, the idea is to have a sign bit, a reciprocal bit, a certain number of bits for a scalar to determine how much to "magnify" the fractional portion, and then the remaining bits are divided evenly between the integer portion of a mixed number, and the latter a fixed-point number which, if the reciprocal bit is set, should be interpreted as one divided by that number. This has the benefit of allowing me to represent numbers with infinite decimal expansions by using their reciprocals which do have terminating decimal expansions, or alternatively, as a fraction directly, potentially as an approximation, depending on my needs.
You can't represent 0.1 exactly in binary for the same reason you can't measure 0.1 inch using a conventional English ruler.
English rulers, like binary fractions, are all about halves. You can measure half an inch, or a quarter of an inch (which is of course half of a half), or an eighth, or a sixteenth, etc.
If you want to measure a tenth of an inch, though, you're out of luck. It's less than an eighth of an inch, but more than a sixteenth. If you try to get more exact, you find that it's a little more than 3/32, but a little less than 7/64. I've never seen an actual ruler that had gradations finer than 64ths, but if you do the math, you'll find that 1/10 is less than 13/128, and it's more than 25/256, and it's more than 51/512. You can keep going finer and finer, to 1024ths and 2048ths and 4096ths and 8192nds, but you will never find an exact marking, even on an infinitely-fine base-2 ruler, that exactly corresponds to 1/10, or 0.1.
You will find something interesting, though. Let's look at all the approximations I've listed, and for each one, record explicitly whether 0.1 is less or greater:
fraction
decimal
0.1 is...
as 0/1
1/2
0.5
less
0
1/4
0.25
less
0
1/8
0.125
less
0
1/16
0.0625
greater
1
3/32
0.09375
greater
1
7/64
0.109375
less
0
13/128
0.1015625
less
0
25/256
0.09765625
greater
1
51/512
0.099609375
greater
1
103/1024
0.1005859375
less
0
205/2048
0.10009765625
less
0
409/4096
0.099853515625
greater
1
819/8192
0.0999755859375
greater
1
Now, if you read down the last column, you get 0001100110011. It's no coincidence that the infinitely-repeating binary fraction for 1/10 is 0.0001100110011...

Resources