Convert between formats with different number of bits for exponent and fractional part - c

I am trying to refresh on floats. I am reading an exercise that asks to convert from format A having: k=3, 4 bits fraction and Bias=3 to format B having k=4, 3 bits fraction and Bias 7.
We should round when necessary.
Example between formats:
011 0000 (Value = 1) =====> 0111 000 (Value = 1)
010 1001 (Value = 25/32) =====> 0110 100 (Value = 3/4 Rounded down)
110 1111 (Value = 31/2) =====> 1011 000 (Value = 16 Rounded up)
Problem: I can not figure out how the conversion works. First of all I managed to do it correctly in some case but my approach was to convert the bit pattern of format A to the decimal value and then to express that value in the bit pattern of format B.
But is there a way to somehow go from one bit pattern to the other without doing this conversion, just knowing that we extend the e by 1 bit and reduce the fraction by 1?

But is there a way to somehow go from one bit pattern to the other without doing this conversion, just knowing that we extend the e by 1 bit and reduce the fraction by 1?
Yes, and this is much simpler than going through the decimal value (which is only correct if you convert to the exact decimal value and not an approximation).
011 0000 (Value = 1)
represents 1.0000 * 23-3
is really 1.0 * 20 in “natural” binary
represents 1.000 * 27-7 to pre-format for the destination format
=====> 0111 000 (Value = 1)
Second example:
010 1001 (Value = 25/32)
represents 1.1001 * 22-3
is really 1.1001 * 2-1
rounds to 1.100 * 2-1 when we suppress one digit, because of “ties-to-even”
is 1.100 * 26-7 pre-formated
=====> 0110 100 (Value = 3/4 Rounded down)
Third example:
110 1111 (Value = 31/2)
represents 1.1111 * 26-3
is really 1.1111 * 23
rounds to 10.000 * 23 when we suppress one digit, because “ties-to-even” means “up” here and the carry propagates a long way
renormalizes into 1.000 * 24
is 1.000 * 211-7 pre-formated
=====> 1011 000 (Value = 16 Rounded up)
Examples 2 and 3 are “halfway cases”. Well, rounding from 4-bit fractions to 3-bit fractions, 50% of examples will be halfway cases anyway.
In example 2, 1.1001 is as close to 1.100 as it is to 1.101. So how is the result chosen? The one that is chosen is the one that ends with 0. Here, 1.100.

Related

float %.2f round but double %.2lf not rounding and how to bypass it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point comparison
I have a problem about the accuracy of float in C/C++. When I execute the program below:
#include <stdio.h>
int main (void) {
float a = 101.1;
double b = 101.1;
printf ("a: %f\n", a);
printf ("b: %lf\n", b);
return 0;
}
Result:
a: 101.099998
b: 101.100000
I believe float should have 32-bit so should be enough to store 101.1 Why?
You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.
There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).
If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.
Applying the knowledge from that answer to your 101.1 number (as a single precision float):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 10000101 10010100011001100110011
| | | || || || |+- 8388608
| | | || || || +-- 4194304
| | | || || |+----- 524288
| | | || || +------ 262144
| | | || |+--------- 32768
| | | || +---------- 16384
| | | |+------------- 2048
| | | +-------------- 1024
| | +------------------ 64
| +-------------------- 16
+----------------------- 2
The mantissa part of that actually continues forever for 101.1:
mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).
hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.
Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 26 or 64.
The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.
When you add all these up, you get 1.57968747615814208984375.
When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.
All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:
#include <stdio.h>
int main (void) {
float f = 101.1f;
printf ("%.50f\n", f);
return 0;
}
which also gave me 101.09999847412109375000....
You need to read more about how floating-point numbers work, especially the part on representable numbers.
You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.
Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.
This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.
Your number 101.1 in base 10 is 1100101.0(0011) in base 2. The 0011 part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.
Looking at the IEE754 standard for floating points, you can find out why the double version seemed to show it entirely.
PS: Derivation of 101.1 in base 10 is 1100101.0(0011) in base 2:
101 = 64 + 32 + 4 + 1
101 -> 1100101
.1 * 2 = .2 -> 0
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2....
PPS: It's the same if you'd wanted to store exactly the result of 1/3 in base 10.
What you see here is the combination of two factors:
IEEE754 floating point representation is not capable of accurately representing a whole class of rational and all irrational numbers
The effects of rounding (by default here to 6 decimal places) in printf. That is say that the error when using a double occurs somewhere to the right of the 6th DP.
If you had more digits to the print of the double you'll see that even double cannot be represented exactly:
printf ("b: %.16f\n", b);
b: 101.0999999999999943
The thing is float and double are using binary format and not all floating pointer numbers can be represented exactly with binary format.
Unfortunately, most decimal floating point numbers cannot be accurately represented in (machine) floating point. This is just how things work.
For instance, the number 101.1 in binary will be represented like 1100101.0(0011) ( the 0011 part will be repeated forever), so no matter how many bytes you have to store it, it will never become accurate. Here is a little article about binary representation of floating point, and here you can find some examples of converting floating point numbers to binary.
If you want to learn more on this subject, I could recommend you this article, though it's long and not too easy to read.

Floating point conversion - Binary -> decimal

Here's the number I'm working on
1 01110 001 = ____
1 sign bit, 5 exp bits, 3 fraction bits
bias = 15
Here's my current process, hopefully you can tell me where I'm missing something
Convert binary exponent to decimal
01110 = 14
Subtract bias
14 - 15 = -1
Multiply fraction bits by result
0.001 * 2^-1 = 0.0001
Convert to decimal
.0001 = 1/16
The sign bit is 1 so my result is -1/16, however the given answer is -9/16. Would anyone mind explaining where the extra 8 in the fraction is coming from?
You seem to have the correct concept, including an understanding of the excess-N representation, but you're missing a crucial point.
The 3 bits used to encode the fractional part of the magnitude are 001, but there is an implicit 1. preceding the fraction bits, so the full magnitude is actually 1.001, which can be represented as an improper fraction as 1+1/8 => 9/8.
2^(-1) is the same as 1/(2^1), or 1/2.
9/8 * 1/2 = 9/16. Take the sign bit into account, and you arrive at the answer -9/16.
For normalized floating point representation, the Mantissa (fractional bits) = 1 + f. This is sometimes called an implied leading 1 representation. This is a trick for getting an additional bit of precision for free since we can always adjust the exponent E so that significant M is in the range 1<=M < 2 ...
You are almost correct but must take into consideration the implied 1. If it is denormalized (meaning the exponent bits are all 0s) you do not add an implied 1.
I would solve this problem as such...
1 01110 001
bias = 2^(k-1) -1 = 14
Exponent = e - bias
14 - 15 = -1
Take the fractional bits ->> 001
Add the implied 1 ->> 1.001
Shift it by the exponent, which is -1. Becomes .1001
Count up the values, 1(1/2) + 0(1/4) + 0(1/8) + 1(1/16) = 9/16
With the a negative sign bit it becomes -9/16
hope that helps!

Number of bits assigned for double data type

How many bits out of 64 is assigned to integer part and fractional part in double. Or is there any rule to specify it?
Note: I know I already replied with a comment. This is for my own benefit as much as the OPs; I always learn something new when I try to explain it.
Floating-point values (regardless of precision) are represented as follows:
sign * significand * βexp
where sign is 1 or -1, β is the base, exp is an integer exponent, and significand is a fraction. In this case, β is 2. For example, the real value 3.0 can be represented as 1.102 * 21, or 0.112 * 22, or even 0.0112 * 23.
Remember that a binary number is a sum of powers of 2, with powers decreasing from the left. For example, 1012 is equivalent to 1 * 22 + 0 * 21 + 1 * 20, which gives us the value 5. You can extend that past the radix point by using negative powers of 2, so 101.112 is equivalent to
1 * 22 + 0 * 21 + 1 * 20 + 1 * 2-1 + 1 * 2-2
which gives us the decimal value 5.75. A floating-point number is normalized such that there's a single non-zero digit prior to the radix point, so instead of writing 5.75 as 101.112, we'd write it as 1.01112 * 22
How is this encoded in a 32-bit or 64-bit binary format? The exact format depends on the platform; most modern platforms use the IEEE-754 specification (which also specifies the algorithms for floating-point arithmetic, as well as special values as infinity and Not A Number (NaN)), however some older platforms may use their own proprietary format (such as VAX G and H extended-precision floats). I think x86 also has a proprietary 80-bit format for intermediate calculations.
The general layout looks something like the following:
seeeeeeee...ffffffff....
where s represents the sign bit, e represents bits devoted to the exponent, and f represents bits devoted to the significand or fraction. The IEEE-754 32-bit single-precision layout is
seeeeeeeefffffffffffffffffffffff
This gives us an 8-bit exponent (which can represent the values -126 through 127) and a 22-bit significand (giving us roughly 6 to 7 significant decimal digits). A 0 in the sign bit represents a positive value, 1 represents negative. The exponent is encoded such that 000000012 represents -126, 011111112 represents 0, and 111111102 represents 127 (000000002 is reserved for representing 0 and "denormalized" numbers, while 111111112 is reserved for representing infinity and NaN). This format also assumes a hidden leading fraction bit that's always set to 1. Thus, our value 5.75, which we represent as 1.01112 * 22, would be encoded in a 32-bit single-precision float as
01000000101110000000000000000000
|| || |
|| |+----------+----------+
|| | |
|+--+---+ +------------ significand (1.0111, hidden leading bit)
| |
| +---------------------------- exponent (2)
+-------------------------------- sign (0, positive)
The IEEE-754 double-precision float uses 11 bits for the exponent (-1022 through 1023) and 52 bits for the significand. I'm not going to bother writing that out (this post is turning into a novel as it is).
Floating-point numbers have a greater range than integers because of the exponent; the exponent 127 only takes 8 bits to encode, but 2127 represents a 38-digit decimal number. The more bits in the exponent, the greater the range of values that can be represented. The precision (the number of significant digits) is determined by the number of bits in the significand. The more bits in the significand, the more significant digits you can represent.
Most real values cannot be represented exactly as a floating-point number; you cannot squeeze an infinite number of values into a finite number of bits. Thus, there are gaps between representable floating point values, and most values will be approximations. To illustrate the problem, let's look at an 8-bit "quarter-precision" format:
seeeefff
This gives us an exponent between -7 and 8 (we're not going to worry about special values like infinity and NaN) and a 3-bit significand with a hidden leading bit. The larger our exponent gets, the wider the gap between representable values gets. Here's a table showing the issue. The left column is the significand; each additional column shows the values we can represent for the given exponent:
sig -1 0 1 2 3 4 5
--- ---- ----- ----- ----- ----- ----- ----
000 0.5 1 2 4 8 16 32
001 0.5625 1.125 2.25 4.5 9 18 36
010 0.625 1.25 2.5 5 10 20 40
011 0.6875 1.375 2.75 5.5 11 22 44
100 0.75 1.5 3 6 12 24 48
101 0.8125 1.625 3.25 6.5 13 26 52
110 0.875 1.75 3.5 7 14 28 56
111 0.9375 1.875 3.75 7.5 15 30 60
Note that as we move towards larger values, the gap between representable values gets larger. We can represent 8 values between 0.5 and 1.0, with a gap of 0.0625 between each. We can represent 8 values between 1.0 and 2.0, with a gap of 0.125 between each. We can represent 8 values between 2.0 and 4.0, with a gap of 0.25 in between each. And so on. Note that we can represent all the positive integers up to 16, but we cannot represent the value 17 in this format; we simply don't have enough bits in the significand to do so. If we add the values 8 and 9 in this format, we'll get 16 as a result, which is a rounding error. If that result is used in any other computation, that rounding error will be compounded.
Note that some values cannot be represented exactly no matter how many bits you have in the significand. Just like 1/3 gives us the non-terminating decimal fraction 0.333333..., 1/10 gives us the non-terminating binary fraction 1.10011001100.... We would need an infinite number of bits in the significand to represent that value.
a double on a 64 bit machine, has one sign bit, 11 exponent bits and 52 fractional bits.
think (1 sign bit) * (52 fractional bits) ^ (11 exponent bits)

C - Fast conversion between binary and hex representations

Reading or writing a C code, I often have difficulties translating the numbers from the binary to the hex representations and back. Usually, different masks like 0xAAAA5555 are used very often in low-level programming, but it's difficult to recognize a special pattern of bits they represent. Is there any easy-to-remember rule how to do it fast in the mind?
Each hex digit map exactly on 4 bit, I usually keep in mind the 8421 weights of each of these bits, so it is very easy to do even an in mind conversion ie
A = 10 = 8+2 = 1010 ...
5 = 4+1 = 0101
just keep the 8-4-2-1 weights in mind.
A 5
8+4+2+1 8+4+2+1
1 0 1 0 0 1 0 1
I always find easy to map HEX to BINARY numbers. Since each hex digit can be directly mapped to a four digit binary number, you can think of:
> 0xA4
As
> b 1010 0100
> ---- ---- (4 binary digits for each part)
> A 4
The conversion is calculated by dividing the base 10 representation by 2 and stringing the remainders in reverse order. I do this in my head, seems to work.
So you say what does 0xAAAA5555 look like
I just work out what A looks like and 5 looks like by doing
A = 10
10 / 2 = 5 r 0
5 / 2 = 2 r 1
2 / 2 = 1 r 0
1 / 2 = 0 r 1
so I know the A's look like 1010 (Note that 4 fingers are a good way to remember the remainders!)
You can string blocks of 4 bits together, so A A is 1010 1010. To convert binary back to hex, I always go through base 10 again by summing up the powers of 2. You can do this by forming blocks of 4 bits (padding with 0s) and string the results.
so 111011101 is 0001 1101 1101 which is (1) (1 + 4 + 8) (1 + 4 + 8) = 1 13 13 which is 1DD

Transform BASE64 string to BASE16(HEX) string?

Hey, I'm trying to write a program to convert from a BASE64 string to a BASE16(HEX) string.
Here's an example:
BASE64: Ba7+Kj3N
HEXADECIMAL: 05 ae fe 2a 3d cd
BINARY: 00000101 10101110 11111110 00101010 00111101 11001101
DECIMAL: 5 174 254 42 61 205
What's the logic to convert from BASE64 to HEXIDECIMAL?
Why is the decimal representation split up?
How come the binary representation is split into 6 section?
Just want the math, the code I can handle just this process is confusing me. Thanks :)
Here's a function listing that will convert between any two bases: https://sites.google.com/site/computersciencesourcecode/conversion-algorithms/base-to-base
Edit (Hopefully to make this completely clear...)
You can find more information on this at the Wikipedia entry for Base 64.
The customary character set used for base 64, which is different than the character set you'll find in the link I provided prior to the edit, is:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
The character 'A' is the value 0, 'B' is the value 1, 'C' is the value 2, ...'8' is the value 60, '9' is the value 61, '+' is the value 62, and '/' is the value 63. This character set is very different from what we're used to using for binary, octal, base 10, and hexadecimal, where the first character is '0', which represents the value 0, etc.
Soju noted in the comments to this answer that each base 64 digit requires 6 bits to represent it in binary. Using the base 64 number provided in the original question and converting from base 64 to binary we get:
B a 7 + K j 3 N
000001 011010 111011 111110 001010 100011 110111 001101
Now we can push all the bits together (the spaces are only there to help humans read the number):
000001011010111011111110001010100011110111001101
Next, we can introduce new white-space delimiters every four bits starting with the Least Significant Bit:
0000 0101 1010 1110 1111 1110 0010 1010 0011 1101 1100 1101
It should now be very easy to see how this number is converted to base 16:
0000 0101 1010 1110 1111 1110 0010 1010 0011 1101 1100 1101
0 5 A E F E 2 A 3 D C D
think of base-64 as base(2^6)
so in order to get alignment with a hex nibble you need at least 2 base 64 digits...
with 2 base-64 digits you have a base-(2^12) number, which could be represented by 3 2^4 digits...
(00)(01)(02)(03)(04)(05)---(06)(07)(08)(09)(10)(11) (base-64) maps directly to:
(00)(01)(02)(03)---(04)(05)(06)(07)---(08)(09)(10)(11) (base 16)
so you can either convert to contiguous binary... or use 4 operations... the operations could deal with binary values, or they could use a set of look up tables (which could work on the char encoded digits):
first base-64 digit to first hex digit
first base-64 digit to first half of second hex digit
second base-64 digit to second half of second hex digit
second base-64 digit to third hex digit.
the advantage of this is that you can work on encoded bases without binary conversion.
it is pretty easy to do in a stream of chars... I am not aware of this actually being implemented anywhere though.

Resources