8 bit versus 32 bit bias with exponet - c

I am trying to work my way thru binary numbers and normalized values. I am confused because we were taught that numbers are represented as 8 bit values. We would do examples with 8 bits where 1 bit was the sign, the next 4 bits were the exponent and then the last 3 were for the number. It was going ok then we jump into 32 bit numbers where the 1 bit is the sign the next 8 are the exponent and the final 23 are the remaining number.
My question is why the different representations? Sometimes numbers are 8 bits sometimes 32 bits? why not make them 3 bits then sometimes 13 bits? Or make them 40 bits and 64 bits? There appears to be no rhyme or reason. Are we dealing with 8 bits when we talk about numbers or 32? Here is an example.
https://www.youtube.com/watch?v=vi5RXPBO-8E
Any help explaining would help. Right now I don't know if I should study the material based on 8 bits or 32 bits with the 1st bit a sign, second 8 the exponent and last 23 the actual number. Very confused.

I assume you were taught how floating point numbers were represented using 8-bits as it's much easier to do the math with smaller numbers; however, you can only represent so many numbers with 8-bits (256 different numbers to be exact).
As you said you learned how floating point numbers work with 8 bits.
SEEEENNN -- Where s is the sign bit, E is the exponent and N is the number/ significand bits
the sign of the number is simply -1 to the sign bit
the exponent is the eponent bets represented as a signed integer or as an unsigned integer minus the representations bias
the significand is 1 + SUM i = 1 to p N[i]*2^(-i) where p is the precision of the number/ number of bits for the significand
the value can be computed as:
-1^S * 2^(exponent) * significand
As a more concrete example (exponent bias of 2^(4-1)-1 = 7)
0 1000 101
s = 0
exponent = 8 - 7 = 1
significand = 1 + 0.5 * 1 + 0.25 * 0 + 0.125 * 1 = 1.625
value = (-1)^0 + 1.625 * 2^1 = 3.25
The same representation scheme can be applied to any number of bits. In that regard it is fairly arbitrary to chose the value 8 or 32.
32 and 64 bits are often chosen to represent floating point numbers in binary format because they are powers of 2, are easily stored in memory (integer number of bytes) and computer ALU's are designed to work with 32/64 bit numbers.
In C a float is 32 bits and a double is 64 bits.
You can read more on IEEE-754 floating point representation. Wikipedia has a good explanation of how it works here.

Related

max floating point value [duplicate]

I am wondering if the max float represented in IEEE 754 is:
(1.11111111111111111111111)_b*2^[(11111111)_b-127]
Here _b means binary representation. But that value is 3.403201383*10^38, which is different from 3.402823669*10^38, which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits>. Isn't
(1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework?
Does anybody know why?
Thank you.
The exponent 11111111b is reserved for infinities and NaNs, so your number cannot be represented.
The greatest value that can be represented in single precision, approximately 3.4028235×1038, is actually 1.11111111111111111111111b×211111110b-127.
See also http://en.wikipedia.org/wiki/Single-precision_floating-point_format
Being the "m" the mantisa and the "e" the exponent, the answer is:
In your case, if the number of bits on IEEE 754 are:
16 Bits you have 1 for the sign, 5 for the exponent and 10 for the mantissa. The largest number represented is 4,293,918,720.
32 Bits you have 1 for the sign, 8 for the exponent and 23 for the mantissa. The largest number represented is 3.402823466E38
64 Bits you have 1 for the sign, 11 for the exponent and 52 for the mantissa. The largest number represented is 2^1024 - 2^971

floating point numbers in C slightly different from expected

I noticed that in C, a float can be as small as 2^-149, and as large as 2^127. If I try to set the float to any smaller or larger respectively than these, then I get zero and inf, respectively. The 2^149 doesn't make sense to me; where does it come from?
It appears that the exponent is 8 bits, so we can have 2^-128 to 2^127. The overall sign of the float is 1 bit, so that leaves 23 bits for the significand since a float is 32 bits total. If all 23 bits of the significand are placed after the binary "decimal point" such that the significand is <= 0.5, then we should be able to have floats as small as 2^(-128-23) = 2^-151. On the other hand, if one of the 23 bits is placed BEFORE the binary "decimal" point such that the significand is <= 1, then we would have the smallest float be 2^(-128-22) = 2^-150. Both of these do not agree with the fact that the smallest float seems to be 2^-149. Why is this?
Infinity (+ or -) is represented by the maximum exponent (all 1 bits), and zero mantissa. NaN is represented by the maximum exponent, and any non-zero mantissa.
Denormal numbers, and zero, are represented with the minimum exponent (all 0 bits).
So those two exponents are not available for normal numbers.

Why does the range of int has a minus 1?

I read that the range of an int is dependent on a byte.
So taking int to be 4 bytes long, thats 4 * 8 bits = 32 bits.
So the range should be : 2 ^ (32-1) = 2 ^ (31)
Why do some people say its 2^31 - 1 though?
Thanks!
Because the counting starts from 0
And the range of int is 2,147,483,647 and 2^32 which is 2,147,483,648. hence we subtract 1
Also the loss of 1 bit is for the positive and negative sign
Check this interestinf wiki article on Integers:-
The most common representation of a positive integer is a string of
bits, using the binary numeral system. The order of the memory bytes
storing the bits varies; see endianness. The width or precision of an
integral type is the number of bits in its representation. An integral
type with n bits can encode 2n numbers; for example an unsigned type
typically represents the non-negative values 0 through 2n−1. Other
encodings of integer values to bit patterns are sometimes used, for
example Binary-coded decimal or Gray code, or as printed character
codes such as ASCII.
There are four well-known ways to represent signed numbers in a binary
computing system. The most common is two's complement, which allows a
signed integral type with n bits to represent numbers from −2(n−1)
through 2(n−1)−1. Two's complement arithmetic is convenient because
there is a perfect one-to-one correspondence between representations
and values (in particular, no separate +0 and −0), and because
addition, subtraction and multiplication do not need to distinguish
between signed and unsigned types. Other possibilities include offset
binary, sign-magnitude, and ones' complement.
You mean 232-1, NOT 232-1.
But your question is about why people use 231. The loss of a whole bit is if the int is a signed one. You lose the first bit to indicate if the number is positive or negative.
A signed int (32 bit) ranges from -2,147,483,648 to +2,147,483,647.
An unsigned int (32 bit) ranges from 0 to 4,294,967,295 (which is 232 -1).
int is a signed data type.
The first bit represents the sign, followed by bits for the value.
If the sign bit is 0, the value is simply the sum of all bits set to 1 ( to the power of 2).
e.g. 0...00101 is 20 + 22 = 5
if the first bit is 1, the value is -232 + the sum of all bits set to 1 (to the power of 2).
e.g. 1...111100 is -232 + 231 + 230 + ... + 22 = -4
all 0 will this result in zero.
When you calculate after, you will see that any number between (and including) the range - 231 and 20 + ... + 231 = 232 - 1 can be created with those 32 bits.
232-1 is not same as 232 - 1 (as 0 is included in the range, we subtract 1)
For your understanding, let us replace by small number 4 instead of 32
24-1 = 8
whereas 24-1 = 16-1 = 15.
Hope this helps!
Since integer is 32 bit. It could store total 2^32 values. So an integer ranges from -2^31 to 2^31-1 giving a total of 2^32 values(2^31 values in the negative range+2^31 values in positive range including 0).However, the first bit(the most significant bit) is reserved for the sign of the integer. Again u need to understand how negative integers are stored.They are stored in 2's complement form, So -9 will be stored as 2's complement of 9.
So 9 is stored in 32 bit system as
0000 0000 0000 0000 0000 0000 0000 1001
and -9 will be stored as
1111 1111 1111 1111 1111 1111 1111 0111 (2's complement of 9).
Again due to some arithmetic operation on an integer, if it happens to exceed the maximum value(2^31-1) then it will recycle to the negative values. So if you add 1 to 2^31-1 it will give you -2^31.

What are the max and min numbers a short type can store in C?

I'm having a hard time grasping data types in C. I'm going through a C book and one of the challenges asks what the maximum and minimum number a short can store.
Using sizeof(short); I can see that a short consumes 2 bytes. That means it's 16 bits, which means two numbers since it takes 8 bits to store the binary representation of a number. For example, 9 would be 00111001 which fills up one bit. So would it not be 0 to 99 for unsigned, and -9 to 9 signed?
I know I'm wrong, but I'm not sure why. It says here the maximum is (-)32,767 for signed, and 65,535 for unsigned.
short int, 2 Bytes, 16 Bits, -32,768 -> +32,767 Range (16kb)
Think in decimal for a second. If you have only 2 digits for a number, that means you can store from 00 to 99 in them. If you have 4 digits, that range becomes 0000 to 9999.
A binary number is similar to decimal, except the digits can be only 0 and 1, instead of 0, 1, 2, 3, ..., 9.
If you have a number like this:
01011101
This is:
0*128 + 1*64 + 0*32 + 1*16 + 1*8 + 1*4 + 0*2 + 1*1 = 93
So as you can see, you can store bigger values than 9 in one byte. In an unsigned 8-bit number, you can actually store values from 00000000 to 11111111, which is 255 in decimal.
In a 2-byte number, this range becomes from 00000000 00000000 to 11111111 11111111 which happens to be 65535.
Your statement "it takes 8 bits to store the binary representation of a number" is like saying "it takes 8 digits to store the decimal representation of a number", which is not correct. For example the number 12345678901234567890 has more than 8 digits. In the same way, you cannot fit all numbers in 8 bits, but only 256 of them. That's why you get 2-byte (short), 4-byte (int) and 8-byte (long long) numbers. In truth, if you need even higher range of numbers, you would need to use a library.
As long as negative numbers are concerned, in a 2's-complement computer, they are just a convention to use the higher half of the range as negative values. This means the numbers that have a 1 on the left side are considered negative.
Nevertheless, these numbers are congruent modulo 256 (modulo 2^n if n bits) to their positive value as the number really suggests. For example the number 11111111 is 255 if unsigned, and -1 if signed which are congruent modulo 256.
The reference you read is correct. At least, for the usual C implementations where short is 16 bits - that's not actually fixed in the standard.
16 bits can hold 2^16 possible bit patterns, that's 65536 possibilities. Signed shorts are -32768 to 32767, unsigned shorts are 0 to 65535.
This is defined in <limits.h>, and is SHRT_MIN & SHRT_MAX.
Others have posted pretty good solutions for you, but I don't think they have followed your thinking and explained where you were wrong. I will try.
I can see that a short consumes 2 bytes. That means it's 16 bits,
Up to this point you are correct (though short is not guaranteed to be 2 bytes long like int is not guaranteed to be 4 — the only guaranteed size by standard (if I remember correctly) is char which should always be 1 byte wide).
which means two numbers since it takes 8 bits to store the binary representation of a number.
From here you started to drift a bit. It doesn't really take 8 bits to store a number. Depending on a number, it may take 16, 32 64 or even more bits to store it. Dividing your 16 bits into 2 is wrong. If not a CPU implementation specifics, we could have had, for example, 2 bit numbers. In that case, those two bits could store values like:
00 - 0 in decimal
01 - 1 in decimal
10 - 2 in decimal
11 - 3 in decimal
To store 4, we need 3 bits. And so the value would "not fit" causing an overflow. Same applies to 16-bit number. For example, say we have unsigned "255" in decimal stored in 16-bits, the binary representation would be 0000000011111111. When you add 1 to that number, it becomes 0000000100000000 (256 in decimal). So if you had only 8 bits, it would overflow and become 0 because the most significant bit would have been discarded.
Now, the maximum unsigned number you can in 16 bits memory is — 1111111111111111, which is 65535 in decimal. In other words, for unsigned numbers - set all bits to 1 and that will yield you the maximum possible value.
For signed numbers, however, the most significant bit represents a sign — 0 for positive and 1 for negative. For negative, the maximum value is 1000000000000000, which is -32678 in base 10. The rules for signed binary representation are well described here.
Hope it helps!
The formula to find the range of any unsigned binary represented number:
2 ^ (sizeof(type)*8)

'float' vs. 'double' precision

The code
float x = 3.141592653589793238;
double z = 3.141592653589793238;
printf("x=%f\n", x);
printf("z=%f\n", z);
printf("x=%20.18f\n", x);
printf("z=%20.18f\n", z);
will give you the output
x=3.141593
z=3.141593
x=3.141592741012573242
z=3.141592653589793116
where on the third line of output 741012573242 is garbage and on the fourth line 116 is garbage. Do doubles always have 16 significant figures while floats always have 7 significant figures? Why don't doubles have 14 significant figures?
Floating point numbers in C use IEEE 754 encoding.
This type of encoding uses a sign, a significand, and an exponent.
Because of this encoding, many numbers will have small changes to allow them to be stored.
Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.
Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.
Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.
Do doubles always have 16 significant
figures while floats always have 7
significant figures?
No. Doubles always have 53 significant bits and floats always have 24 significant bits (except for denormals, infinities, and NaN values, but those are subjects for a different question). These are binary formats, and you can only speak clearly about the precision of their representations in terms of binary digits (bits).
This is analogous to the question of how many digits can be stored in a binary integer: an unsigned 32 bit integer can store integers with up to 32 bits, which doesn't precisely map to any number of decimal digits: all integers of up to 9 decimal digits can be stored, but a lot of 10-digit numbers can be stored as well.
Why don't doubles
have 14 significant figures?
The encoding of a double uses 64 bits (1 bit for the sign, 11 bits for the exponent, 52 explicit significant bits and one implicit bit), which is double the number of bits used to represent a float (32 bits).
float: 23 bits of significand, 8 bits of exponent, and 1 sign bit.
double: 52 bits of significand, 11 bits of exponent, and 1 sign bit.
It's usually based on significant figures of both the exponent and significand in base 2, not base 10. From what I can tell in the C99 standard, however, there is no specified precision for floats and doubles (other than the fact that 1 and 1 + 1E-5 / 1 + 1E-7 are distinguishable [float and double repsectively]). However, the number of significant figures is left to the implementer (as well as which base they use internally, so in other words, an implementation could decide to make it based on 18 digits of precision in base 3). [1]
If you need to know these values, the constants FLT_RADIX and FLT_MANT_DIG (and DBL_MANT_DIG / LDBL_MANT_DIG) are defined in float.h.
The reason it's called a double is because the number of bytes used to store it is double the number of a float (but this includes both the exponent and significand). The IEEE 754 standard (used by most compilers) allocate relatively more bits for the significand than the exponent (23 to 9 for float vs. 52 to 12 for double), which is why the precision is more than doubled.
1: Section 5.2.4.2.2 ( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf )
A float has 23 bits of precision, and a double has 52.
It's not exactly double precision because of how IEEE 754 works, and because binary doesn't really translate well to decimal. Take a look at the standard if you're interested.

Resources