Fraction to right of radix - Floating point conversion - c

When converting a number from base 10 to binary using the floating point bit model, what determines how many zeros you "zero pad" the fraction to the right of the radix?
Take for example -44.375
It was a question on a test in my systems programming course, and below is the answer the prof provided the class with... I posted this because most comments below seem to argue what my professor states in the answer and causing some confusion.
Answer: 1 1000 0100 0110 0011 0000 0000 0000 000
-- sign bit: 1
-- fixed point: -44.375 = 25 + 23 + 22 + 2-2 + 2-3
= 101100.011
= 1.01100011 * 2<sup>5</sup>
-- exponent: 5 + 127 = 132 = 1000 0100
-- fraction: 0110 0011 0000 0000 0000 000
Marking:
-- 1 mark for correct sign bit
-- 2 marks for correct fixed point representation
-- 2 marks for correct exponent (in binary)
-- 2 marks for correct fraction (padded with zeros)

Unless the float is very small, there is no left "zero pad" of the fraction.
The sample here is -1.63 (in hexadecimal) * power(2,5 (decimal)).
The exponent is adjusted until the leading digit is 1.
printf("%a\n", -44.375);
// -0x1.63p+5
[Edit]
Your prof wants to see "2 marks for correct fraction (padded with zeros)" as the number of bits in a float, so the significand in your example is
1.0110 0011 0000 0000 0000 000
The leading 1 is not stored explicitly in a typical float.
OP "what determines how many zeros you "zero pad" the fraction to the right of the radix?
A: IEEE 754 binary32 (a popular float implementation) has a 24 bit significand. A lead bit (usually 1) and a 23-bit fraction. Thus your "right" zero padding goes out to fill 23 places.

To determine the significand of an IEEE-754 32-bit binary floating-point value:
Figure out where the leading (most significant) 1 bit is. That is the starting point. Calculate 23 more bits. If there is anything left over, round it into last of the 24 bits (carrying as necessary).
Exception: If the leading bit is less than 2-126, use the 2-126 bit as the starting point, even though it is zero.
That gives the mathematical significand. To get the bits for the significand field, remove the first bit. (And, if the exception was used, set the encoded exponent to zero instead of the normal value.)
Another exception: If the leading bit, after rounding, is 2128 or greater, the conversion overflows. Set the result to infinity.

Related

Signed range equilance [duplicate]

I'm in a computer systems course and have been struggling, in part, with two's complement. I want to understand it, but everything I've read hasn't brought the picture together for me. I've read the Wikipedia article and various other articles, including my text book.
What is two's complement, how can we use it and how can it affect numbers during operations like casts (from signed to unsigned and vice versa), bit-wise operations and bit-shift operations?
Two's complement is a clever way of storing integers so that common math problems are very simple to implement.
To understand, you have to think of the numbers in binary.
It basically says,
for zero, use all 0's.
for positive integers, start counting up, with a maximum of 2(number of bits - 1)-1.
for negative integers, do exactly the same thing, but switch the role of 0's and 1's and count down (so instead of starting with 0000, start with 1111 - that's the "complement" part).
Let's try it with a mini-byte of 4 bits (we'll call it a nibble - 1/2 a byte).
0000 - zero
0001 - one
0010 - two
0011 - three
0100 to 0111 - four to seven
That's as far as we can go in positives. 23-1 = 7.
For negatives:
1111 - negative one
1110 - negative two
1101 - negative three
1100 to 1000 - negative four to negative eight
Note that you get one extra value for negatives (1000 = -8) that you don't for positives. This is because 0000 is used for zero. This can be considered as Number Line of computers.
Distinguishing between positive and negative numbers
Doing this, the first bit gets the role of the "sign" bit, as it can be used to distinguish between nonnegative and negative decimal values. If the most significant bit is 1, then the binary can be said to be negative, where as if the most significant bit (the leftmost) is 0, you can say the decimal value is nonnegative.
"Sign-magnitude" negative numbers just have the sign bit flipped of their positive counterparts, but this approach has to deal with interpreting 1000 (one 1 followed by all 0s) as "negative zero" which is confusing.
"Ones' complement" negative numbers are just the bit-complement of their positive counterparts, which also leads to a confusing "negative zero" with 1111 (all ones).
You will likely not have to deal with Ones' Complement or Sign-Magnitude integer representations unless you are working very close to the hardware.
I wonder if it could be explained any better than the Wikipedia article.
The basic problem that you are trying to solve with two's complement representation is the problem of storing negative integers.
First, consider an unsigned integer stored in 4 bits. You can have the following
0000 = 0
0001 = 1
0010 = 2
...
1111 = 15
These are unsigned because there is no indication of whether they are negative or positive.
Sign Magnitude and Excess Notation
To store negative numbers you can try a number of things. First, you can use sign magnitude notation which assigns the first bit as a sign bit to represent +/- and the remaining bits to represent the magnitude. So using 4 bits again and assuming that 1 means - and 0 means + then you have
0000 = +0
0001 = +1
0010 = +2
...
1000 = -0
1001 = -1
1111 = -7
So, you see the problem there? We have positive and negative 0. The bigger problem is adding and subtracting binary numbers. The circuits to add and subtract using sign magnitude will be very complex.
What is
0010
1001 +
----
?
Another system is excess notation. You can store negative numbers, you get rid of the two zeros problem but addition and subtraction remains difficult.
So along comes two's complement. Now you can store positive and negative integers and perform arithmetic with relative ease. There are a number of methods to convert a number into two's complement. Here's one.
Convert Decimal to Two's Complement
Convert the number to binary (ignore the sign for now)
e.g. 5 is 0101 and -5 is 0101
If the number is a positive number then you are done.
e.g. 5 is 0101 in binary using two's complement notation.
If the number is negative then
3.1 find the complement (invert 0's and 1's)
e.g. -5 is 0101 so finding the complement is 1010
3.2 Add 1 to the complement 1010 + 1 = 1011.
Therefore, -5 in two's complement is 1011.
So, what if you wanted to do 2 + (-3) in binary? 2 + (-3) is -1.
What would you have to do if you were using sign magnitude to add these numbers? 0010 + 1101 = ?
Using two's complement consider how easy it would be.
2 = 0010
-3 = 1101 +
-------------
-1 = 1111
Converting Two's Complement to Decimal
Converting 1111 to decimal:
The number starts with 1, so it's negative, so we find the complement of 1111, which is 0000.
Add 1 to 0000, and we obtain 0001.
Convert 0001 to decimal, which is 1.
Apply the sign = -1.
Tada!
Like most explanations I've seen, the ones above are clear about how to work with 2's complement, but don't really explain what they are mathematically. I'll try to do that, for integers at least, and I'll cover some background that's probably familiar first.
Recall how it works for decimal: 2345 is a way of writing 2 × 103 + 3 × 102 + 4 × 101 + 5 × 100.
In the same way, binary is a way of writing numbers using just 0 and 1 following the same general idea, but replacing those 10s above with 2s. Then in binary, 1111is a way of writing 1 × 23 + 1 × 22 + 1 × 21 + 1 × 20and if you work it out, that turns out to equal 15 (base 10). That's because it is 8+4+2+1 = 15.
This is all well and good for positive numbers. It even works for negative numbers if you're willing to just stick a minus sign in front of them, as humans do with decimal numbers. That can even be done in computers, sort of, but I haven't seen such a computer since the early 1970's. I'll leave the reasons for a different discussion.
For computers it turns out to be more efficient to use a complement representation for negative numbers. And here's something that is often overlooked. Complement notations involve some kind of reversal of the digits of the number, even the implied zeroes that come before a normal positive number. That's awkward, because the question arises: all of them? That could be an infinite number of digits to be considered.
Fortunately, computers don't represent infinities. Numbers are constrained to a particular length (or width, if you prefer). So let's return to positive binary numbers, but with a particular size. I'll use 8 digits ("bits") for these examples. So our binary number would really be 00001111or 0 × 27 + 0 × 26 + 0 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 1 × 21 + 1 × 20
To form the 2's complement negative, we first complement all the (binary) digits to form 11110000and add 1 to form 11110001but how are we to understand that to mean -15?
The answer is that we change the meaning of the high-order bit (the leftmost one). This bit will be a 1 for all negative numbers. The change will be to change the sign of its contribution to the value of the number it appears in. So now our 11110001 is understood to represent -1 × 27 + 1 × 26 + 1 × 25 + 1 × 24 + 0 × 23 + 0 × 22 + 0 × 21 + 1 × 20Notice that "-" in front of that expression? It means that the sign bit carries the weight -27, that is -128 (base 10). All the other positions retain the same weight they had in unsigned binary numbers.
Working out our -15, it is -128 + 64 + 32 + 16 + 1 Try it on your calculator. it's -15.
Of the three main ways that I've seen negative numbers represented in computers, 2's complement wins hands down for convenience in general use. It has an oddity, though. Since it's binary, there have to be an even number of possible bit combinations. Each positive number can be paired with its negative, but there's only one zero. Negating a zero gets you zero. So there's one more combination, the number with 1 in the sign bit and 0 everywhere else. The corresponding positive number would not fit in the number of bits being used.
What's even more odd about this number is that if you try to form its positive by complementing and adding one, you get the same negative number back. It seems natural that zero would do this, but this is unexpected and not at all the behavior we're used to because computers aside, we generally think of an unlimited supply of digits, not this fixed-length arithmetic.
This is like the tip of an iceberg of oddities. There's more lying in wait below the surface, but that's enough for this discussion. You could probably find more if you research "overflow" for fixed-point arithmetic. If you really want to get into it, you might also research "modular arithmetic".
2's complement is very useful for finding the value of a binary, however I thought of a much more concise way of solving such a problem(never seen anyone else publish it):
take a binary, for example: 1101 which is [assuming that space "1" is the sign] equal to -3.
using 2's complement we would do this...flip 1101 to 0010...add 0001 + 0010 ===> gives us 0011. 0011 in positive binary = 3. therefore 1101 = -3!
What I realized:
instead of all the flipping and adding, you can just do the basic method for solving for a positive binary(lets say 0101) is (23 * 0) + (22 * 1) + (21 * 0) + (20 * 1) = 5.
Do exactly the same concept with a negative!(with a small twist)
take 1101, for example:
for the first number instead of 23 * 1 = 8 , do -(23 * 1) = -8.
then continue as usual, doing -8 + (22 * 1) + (21 * 0) + (20 * 1) = -3
Imagine that you have a finite number of bits/trits/digits/whatever. You define 0 as all digits being 0, and count upwards naturally:
00
01
02
..
Eventually you will overflow.
98
99
00
We have two digits and can represent all numbers from 0 to 100. All those numbers are positive! Suppose we want to represent negative numbers too?
What we really have is a cycle. The number before 2 is 1. The number before 1 is 0. The number before 0 is... 99.
So, for simplicity, let's say that any number over 50 is negative. "0" through "49" represent 0 through 49. "99" is -1, "98" is -2, ... "50" is -50.
This representation is ten's complement. Computers typically use two's complement, which is the same except using bits instead of digits.
The nice thing about ten's complement is that addition just works. You do not need to do anything special to add positive and negative numbers!
I read a fantastic explanation on Reddit by jng, using the odometer as an analogy.
It is a useful convention. The same circuits and logic operations that
add / subtract positive numbers in binary still work on both positive
and negative numbers if using the convention, that's why it's so
useful and omnipresent.
Imagine the odometer of a car, it rolls around at (say) 99999. If you
increment 00000 you get 00001. If you decrement 00000, you get 99999
(due to the roll-around). If you add one back to 99999 it goes back to
00000. So it's useful to decide that 99999 represents -1. Likewise, it is very useful to decide that 99998 represents -2, and so on. You have
to stop somewhere, and also by convention, the top half of the numbers
are deemed to be negative (50000-99999), and the bottom half positive
just stand for themselves (00000-49999). As a result, the top digit
being 5-9 means the represented number is negative, and it being 0-4
means the represented is positive - exactly the same as the top bit
representing sign in a two's complement binary number.
Understanding this was hard for me too. Once I got it and went back to
re-read the books articles and explanations (there was no internet
back then), it turned out a lot of those describing it didn't really
understand it. I did write a book teaching assembly language after
that (which did sell quite well for 10 years).
Two complement is found out by adding one to 1'st complement of the given number.
Lets say we have to find out twos complement of 10101 then find its ones complement, that is, 01010 add 1 to this result, that is, 01010+1=01011, which is the final answer.
Lets get the answer 10 – 12 in binary form using 8 bits:
What we will really do is 10 + (-12)
We need to get the compliment part of 12 to subtract it from 10.
12 in binary is 00001100.
10 in binary is 00001010.
To get the compliment part of 12 we just reverse all the bits then add 1.
12 in binary reversed is 11110011. This is also the Inverse code (one's complement).
Now we need to add one, which is now 11110100.
So 11110100 is the compliment of 12! Easy when you think of it this way.
Now you can solve the above question of 10 - 12 in binary form.
00001010
11110100
-----------------
11111110
Looking at the two's complement system from a math point of view it really makes sense. In ten's complement, the idea is to essentially 'isolate' the difference.
Example: 63 - 24 = x
We add the complement of 24 which is really just (100 - 24). So really, all we are doing is adding 100 on both sides of the equation.
Now the equation is: 100 + 63 - 24 = x + 100, that is why we remove the 100 (or 10 or 1000 or whatever).
Due to the inconvenient situation of having to subtract one number from a long chain of zeroes, we use a 'diminished radix complement' system, in the decimal system, nine's complement.
When we are presented with a number subtracted from a big chain of nines, we just need to reverse the numbers.
Example: 99999 - 03275 = 96724
That is the reason, after nine's complement, we add 1. As you probably know from childhood math, 9 becomes 10 by 'stealing' 1. So basically it's just ten's complement that takes 1 from the difference.
In Binary, two's complement is equatable to ten's complement, while one's complement to nine's complement. The primary difference is that instead of trying to isolate the difference with powers of ten (adding 10, 100, etc. into the equation) we are trying to isolate the difference with powers of two.
It is for this reason that we invert the bits. Just like how our minuend is a chain of nines in decimal, our minuend is a chain of ones in binary.
Example: 111111 - 101001 = 010110
Because chains of ones are 1 below a nice power of two, they 'steal' 1 from the difference like nine's do in decimal.
When we are using negative binary number's, we are really just saying:
0000 - 0101 = x
1111 - 0101 = 1010
1111 + 0000 - 0101 = x + 1111
In order to 'isolate' x, we need to add 1 because 1111 is one away from 10000 and we remove the leading 1 because we just added it to the original difference.
1111 + 1 + 0000 - 0101 = x + 1111 + 1
10000 + 0000 - 0101 = x + 10000
Just remove 10000 from both sides to get x, it's basic algebra.
The word complement derives from completeness. In the decimal world the numerals 0 through 9 provide a complement (complete set) of numerals or numeric symbols to express all decimal numbers. In the binary world the numerals 0 and 1 provide a complement of numerals to express all binary numbers. In fact The symbols 0 and 1 must be used to represent everything (text, images, etc) as well as positive (0) and negative (1).
In our world the blank space to the left of number is considered as zero:
35=035=000000035.
In a computer storage location there is no blank space. All bits (binary digits) must be either 0 or 1. To efficiently use memory numbers may be stored as 8 bit, 16 bit, 32 bit, 64 bit, 128 bit representations. When a number that is stored as an 8 bit number is transferred to a 16 bit location the sign and magnitude (absolute value) must remain the same. Both 1's complement and 2's complement representations facilitate this.
As a noun:
Both 1's complement and 2's complement are binary representations of signed quantities where the most significant bit (the one on the left) is the sign bit. 0 is for positive and 1 is for negative.
2s complement does not mean negative. It means a signed quantity. As in decimal the magnitude is represented as the positive quantity. The structure uses sign extension to preserve the quantity when promoting to a register [] with more bits:
[0101]=[00101]=[00000000000101]=5 (base 10)
[1011]=[11011]=[11111111111011]=-5(base 10)
As a verb:
2's complement means to negate. It does not mean make negative. It means if negative make positive; if positive make negative. The magnitude is the absolute value:
if a >= 0 then |a| = a
if a < 0 then |a| = -a = 2scomplement of a
This ability allows efficient binary subtraction using negate then add.
a - b = a + (-b)
The official way to take the 1's complement is for each digit subtract its value from 1.
1'scomp(0101) = 1010.
This is the same as flipping or inverting each bit individually. This results in a negative zero which is not well loved so adding one to te 1's complement gets rid of the problem.
To negate or take the 2s complement first take the 1s complement then add 1.
Example 1 Example 2
0101 --original number 1101
1's comp 1010 0010
add 1 0001 0001
2's comp 1011 --negated number 0011
In the examples the negation works as well with sign extended numbers.
Adding:
1110 Carry 111110 Carry
0110 is the same as 000110
1111 111111
sum 0101 sum 000101
SUbtracting:
1110 Carry 00000 Carry
0110 is the same as 00110
-0111 +11001
---------- ----------
sum 0101 sum 11111
Notice that when working with 2's complement, blank space to the left of the number is filled with zeros for positive numbers butis filled with ones for negative numbers. The carry is always added and must be either a 1 or 0.
Cheers
2's complement is essentially a way of coming up with the additive inverse of a binary number. Ask yourself this: Given a number in binary form (present at a fixed length memory location), what bit pattern, when added to the original number (at the fixed length memory location), would make the result all zeros ? (at the same fixed length memory location). If we could come up with this bit pattern then that bit pattern would be the -ve representation (additive inverse) of the original number; as by definition adding a number to its additive inverse always results in zero. Example: take 5 which is 101 present inside a single 8 bit byte. Now the task is to come up with a bit pattern which when added to the given bit pattern (00000101) would result in all zeros at the memory location which is used to hold this 5 i.e. all 8 bits of the byte should be zero. To do that, start from the right most bit of 101 and for each individual bit, again ask the same question: What bit should I add to the current bit to make the result zero ? continue doing that taking in account the usual carry over. After we are done with the 3 right most places (the digits that define the original number without regard to the leading zeros) the last carry goes in the bit pattern of the additive inverse. Furthermore, since we are holding in the original number in a single 8 bit byte, all other leading bits in the additive inverse should also be 1's so that (and this is important) when the computer adds "the number" (represented using the 8 bit pattern) and its additive inverse using "that" storage type (a byte) the result in that byte would be all zeros.
1 1 1
----------
1 0 1
1 0 1 1 ---> additive inverse
---------
0 0 0
Many of the answers so far nicely explain why two's complement is used to represent negative numbers, but do not tell us what two's complement number is, particularly not why a '1' is added, and in fact often added in a wrong way.
The confusion comes from a poor understanding of the definition of a complement number. A complement is the missing part that would make something complete.
The radix complement of an n digit number x in radix b is, by definition, b^n-x.
In binary 4 is represented by 100, which has 3 digits (n=3) and a radix of 2 (b=2). So its radix complement is b^n-x = 2^3-4=8-4=4 (or 100 in binary).
However, in binary obtaining a radix's complement is not as easy as getting its diminished radix complement, which is defined as (b^n-1)-y, just 1 less than that of radix complement. To get a diminished radix complement, you simply flip all the digits.
100 -> 011 (diminished (one's) radix complement)
to obtain the radix (two's) complement, we simply add 1, as the definition defined.
011 +1 ->100 (two's complement).
Now with this new understanding, let's take a look of the example given by Vincent Ramdhanie (see above second response):
Converting 1111 to decimal:
The number starts with 1, so it's negative, so we find the complement of 1111, which is 0000.
Add 1 to 0000, and we obtain 0001.
Convert 0001 to decimal, which is 1.
Apply the sign = -1.
Tada!
Should be understood as:
The number starts with 1, so it's negative. So we know it is a two's complement of some value x. To find the x represented by its two's complement, we first need find its 1's complement.
two's complement of x: 1111
one's complement of x: 1111-1 ->1110;
x = 0001, (flip all digits)
Apply the sign -, and the answer =-x =-1.
I liked lavinio's answer, but shifting bits adds some complexity. Often there's a choice of moving bits while respecting the sign bit or while not respecting the sign bit. This is the choice between treating the numbers as signed (-8 to 7 for a nibble, -128 to 127 for bytes) or full-range unsigned numbers (0 to 15 for nibbles, 0 to 255 for bytes).
It is a clever means of encoding negative integers in such a way that approximately half of the combination of bits of a data type are reserved for negative integers, and the addition of most of the negative integers with their corresponding positive integers results in a carry overflow that leaves the result to be binary zero.
So, in 2's complement if one is 0x0001 then -1 is 0x1111, because that will result in a combined sum of 0x0000 (with an overflow of 1).
2’s Complements: When we add an extra one with the 1’s complements of a number we will get the 2’s complements. For example: 100101 it’s 1’s complement is 011010 and 2’s complement is 011010+1 = 011011 (By adding one with 1's complement) For more information
this article explain it graphically.
Two's complement is mainly used for the following reasons:
To avoid multiple representations of 0
To avoid keeping track of carry bit (as in one's complement) in case of an overflow.
Carrying out simple operations like addition and subtraction becomes easy.
Two's complement is one of the ways of expressing a negative number and most of the controllers and processors store a negative number in two's complement form.
In simple terms, two's complement is a way to store negative numbers in computer memory. Whereas positive numbers are stored as a normal binary number.
Let's consider this example,
The computer uses the binary number system to represent any number.
x = 5;
This is represented as 0101.
x = -5;
When the computer encounters the - sign, it computes its two's complement and stores it.
That is, 5 = 0101 and its two's complement is 1011.
The important rules the computer uses to process numbers are,
If the first bit is 1 then it must be a negative number.
If all the bits except first bit are 0 then it is a positive number, because there is no -0 in number system (1000 is not -0 instead it is positive 8).
If all the bits are 0 then it is 0.
Else it is a positive number.
To bitwise complement a number is to flip all the bits in it. To two’s complement it, we flip all the bits and add one.
Using 2’s complement representation for signed integers, we apply the 2’s complement operation to convert a positive number to its negative equivalent and vice versa. So using nibbles for an example, 0001 (1) becomes 1111 (-1) and applying the op again, returns to 0001.
The behaviour of the operation at zero is advantageous in giving a single representation for zero without special handling of positive and negative zeroes. 0000 complements to 1111, which when 1 is added. overflows to 0000, giving us one zero, rather than a positive and a negative one.
A key advantage of this representation is that the standard addition circuits for unsigned integers produce correct results when applied to them. For example adding 1 and -1 in nibbles: 0001 + 1111, the bits overflow out of the register, leaving behind 0000.
For a gentle introduction, the wonderful Computerphile have produced a video on the subject.
The question is 'What is “two's complement”?'
The simple answer for those wanting to understand it theoretically (and me seeking to complement the other more practical answers): 2's complement is the representation for negative integers in the dual system that does not require additional characters, such as + and -.
Two's complement of a given number is the number got by adding 1 with the ones' complement of the number.
Suppose, we have a binary number: 10111001101
Its 1's complement is: 01000110010
And its two's complement will be: 01000110011
Reference: Two's Complement (Thomas Finley)
I invert all the bits and add 1. Programmatically:
// In C++11
int _powers[] = {
1,
2,
4,
8,
16,
32,
64,
128
};
int value = 3;
int n_bits = 4;
int twos_complement = (value ^ ( _powers[n_bits]-1)) + 1;
You can also use an online calculator to calculate the two's complement binary representation of a decimal number: http://www.convertforfree.com/twos-complement-calculator/
The simplest answer:
1111 + 1 = (1)0000. So 1111 must be -1. Then -1 + 1 = 0.
It's perfect to understand these all for me.

-2^31 as smallest integer, why? [duplicate]

I'm in a computer systems course and have been struggling, in part, with two's complement. I want to understand it, but everything I've read hasn't brought the picture together for me. I've read the Wikipedia article and various other articles, including my text book.
What is two's complement, how can we use it and how can it affect numbers during operations like casts (from signed to unsigned and vice versa), bit-wise operations and bit-shift operations?
Two's complement is a clever way of storing integers so that common math problems are very simple to implement.
To understand, you have to think of the numbers in binary.
It basically says,
for zero, use all 0's.
for positive integers, start counting up, with a maximum of 2(number of bits - 1)-1.
for negative integers, do exactly the same thing, but switch the role of 0's and 1's and count down (so instead of starting with 0000, start with 1111 - that's the "complement" part).
Let's try it with a mini-byte of 4 bits (we'll call it a nibble - 1/2 a byte).
0000 - zero
0001 - one
0010 - two
0011 - three
0100 to 0111 - four to seven
That's as far as we can go in positives. 23-1 = 7.
For negatives:
1111 - negative one
1110 - negative two
1101 - negative three
1100 to 1000 - negative four to negative eight
Note that you get one extra value for negatives (1000 = -8) that you don't for positives. This is because 0000 is used for zero. This can be considered as Number Line of computers.
Distinguishing between positive and negative numbers
Doing this, the first bit gets the role of the "sign" bit, as it can be used to distinguish between nonnegative and negative decimal values. If the most significant bit is 1, then the binary can be said to be negative, where as if the most significant bit (the leftmost) is 0, you can say the decimal value is nonnegative.
"Sign-magnitude" negative numbers just have the sign bit flipped of their positive counterparts, but this approach has to deal with interpreting 1000 (one 1 followed by all 0s) as "negative zero" which is confusing.
"Ones' complement" negative numbers are just the bit-complement of their positive counterparts, which also leads to a confusing "negative zero" with 1111 (all ones).
You will likely not have to deal with Ones' Complement or Sign-Magnitude integer representations unless you are working very close to the hardware.
I wonder if it could be explained any better than the Wikipedia article.
The basic problem that you are trying to solve with two's complement representation is the problem of storing negative integers.
First, consider an unsigned integer stored in 4 bits. You can have the following
0000 = 0
0001 = 1
0010 = 2
...
1111 = 15
These are unsigned because there is no indication of whether they are negative or positive.
Sign Magnitude and Excess Notation
To store negative numbers you can try a number of things. First, you can use sign magnitude notation which assigns the first bit as a sign bit to represent +/- and the remaining bits to represent the magnitude. So using 4 bits again and assuming that 1 means - and 0 means + then you have
0000 = +0
0001 = +1
0010 = +2
...
1000 = -0
1001 = -1
1111 = -7
So, you see the problem there? We have positive and negative 0. The bigger problem is adding and subtracting binary numbers. The circuits to add and subtract using sign magnitude will be very complex.
What is
0010
1001 +
----
?
Another system is excess notation. You can store negative numbers, you get rid of the two zeros problem but addition and subtraction remains difficult.
So along comes two's complement. Now you can store positive and negative integers and perform arithmetic with relative ease. There are a number of methods to convert a number into two's complement. Here's one.
Convert Decimal to Two's Complement
Convert the number to binary (ignore the sign for now)
e.g. 5 is 0101 and -5 is 0101
If the number is a positive number then you are done.
e.g. 5 is 0101 in binary using two's complement notation.
If the number is negative then
3.1 find the complement (invert 0's and 1's)
e.g. -5 is 0101 so finding the complement is 1010
3.2 Add 1 to the complement 1010 + 1 = 1011.
Therefore, -5 in two's complement is 1011.
So, what if you wanted to do 2 + (-3) in binary? 2 + (-3) is -1.
What would you have to do if you were using sign magnitude to add these numbers? 0010 + 1101 = ?
Using two's complement consider how easy it would be.
2 = 0010
-3 = 1101 +
-------------
-1 = 1111
Converting Two's Complement to Decimal
Converting 1111 to decimal:
The number starts with 1, so it's negative, so we find the complement of 1111, which is 0000.
Add 1 to 0000, and we obtain 0001.
Convert 0001 to decimal, which is 1.
Apply the sign = -1.
Tada!
Like most explanations I've seen, the ones above are clear about how to work with 2's complement, but don't really explain what they are mathematically. I'll try to do that, for integers at least, and I'll cover some background that's probably familiar first.
Recall how it works for decimal: 2345 is a way of writing 2 × 103 + 3 × 102 + 4 × 101 + 5 × 100.
In the same way, binary is a way of writing numbers using just 0 and 1 following the same general idea, but replacing those 10s above with 2s. Then in binary, 1111is a way of writing 1 × 23 + 1 × 22 + 1 × 21 + 1 × 20and if you work it out, that turns out to equal 15 (base 10). That's because it is 8+4+2+1 = 15.
This is all well and good for positive numbers. It even works for negative numbers if you're willing to just stick a minus sign in front of them, as humans do with decimal numbers. That can even be done in computers, sort of, but I haven't seen such a computer since the early 1970's. I'll leave the reasons for a different discussion.
For computers it turns out to be more efficient to use a complement representation for negative numbers. And here's something that is often overlooked. Complement notations involve some kind of reversal of the digits of the number, even the implied zeroes that come before a normal positive number. That's awkward, because the question arises: all of them? That could be an infinite number of digits to be considered.
Fortunately, computers don't represent infinities. Numbers are constrained to a particular length (or width, if you prefer). So let's return to positive binary numbers, but with a particular size. I'll use 8 digits ("bits") for these examples. So our binary number would really be 00001111or 0 × 27 + 0 × 26 + 0 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 1 × 21 + 1 × 20
To form the 2's complement negative, we first complement all the (binary) digits to form 11110000and add 1 to form 11110001but how are we to understand that to mean -15?
The answer is that we change the meaning of the high-order bit (the leftmost one). This bit will be a 1 for all negative numbers. The change will be to change the sign of its contribution to the value of the number it appears in. So now our 11110001 is understood to represent -1 × 27 + 1 × 26 + 1 × 25 + 1 × 24 + 0 × 23 + 0 × 22 + 0 × 21 + 1 × 20Notice that "-" in front of that expression? It means that the sign bit carries the weight -27, that is -128 (base 10). All the other positions retain the same weight they had in unsigned binary numbers.
Working out our -15, it is -128 + 64 + 32 + 16 + 1 Try it on your calculator. it's -15.
Of the three main ways that I've seen negative numbers represented in computers, 2's complement wins hands down for convenience in general use. It has an oddity, though. Since it's binary, there have to be an even number of possible bit combinations. Each positive number can be paired with its negative, but there's only one zero. Negating a zero gets you zero. So there's one more combination, the number with 1 in the sign bit and 0 everywhere else. The corresponding positive number would not fit in the number of bits being used.
What's even more odd about this number is that if you try to form its positive by complementing and adding one, you get the same negative number back. It seems natural that zero would do this, but this is unexpected and not at all the behavior we're used to because computers aside, we generally think of an unlimited supply of digits, not this fixed-length arithmetic.
This is like the tip of an iceberg of oddities. There's more lying in wait below the surface, but that's enough for this discussion. You could probably find more if you research "overflow" for fixed-point arithmetic. If you really want to get into it, you might also research "modular arithmetic".
2's complement is very useful for finding the value of a binary, however I thought of a much more concise way of solving such a problem(never seen anyone else publish it):
take a binary, for example: 1101 which is [assuming that space "1" is the sign] equal to -3.
using 2's complement we would do this...flip 1101 to 0010...add 0001 + 0010 ===> gives us 0011. 0011 in positive binary = 3. therefore 1101 = -3!
What I realized:
instead of all the flipping and adding, you can just do the basic method for solving for a positive binary(lets say 0101) is (23 * 0) + (22 * 1) + (21 * 0) + (20 * 1) = 5.
Do exactly the same concept with a negative!(with a small twist)
take 1101, for example:
for the first number instead of 23 * 1 = 8 , do -(23 * 1) = -8.
then continue as usual, doing -8 + (22 * 1) + (21 * 0) + (20 * 1) = -3
Imagine that you have a finite number of bits/trits/digits/whatever. You define 0 as all digits being 0, and count upwards naturally:
00
01
02
..
Eventually you will overflow.
98
99
00
We have two digits and can represent all numbers from 0 to 100. All those numbers are positive! Suppose we want to represent negative numbers too?
What we really have is a cycle. The number before 2 is 1. The number before 1 is 0. The number before 0 is... 99.
So, for simplicity, let's say that any number over 50 is negative. "0" through "49" represent 0 through 49. "99" is -1, "98" is -2, ... "50" is -50.
This representation is ten's complement. Computers typically use two's complement, which is the same except using bits instead of digits.
The nice thing about ten's complement is that addition just works. You do not need to do anything special to add positive and negative numbers!
I read a fantastic explanation on Reddit by jng, using the odometer as an analogy.
It is a useful convention. The same circuits and logic operations that
add / subtract positive numbers in binary still work on both positive
and negative numbers if using the convention, that's why it's so
useful and omnipresent.
Imagine the odometer of a car, it rolls around at (say) 99999. If you
increment 00000 you get 00001. If you decrement 00000, you get 99999
(due to the roll-around). If you add one back to 99999 it goes back to
00000. So it's useful to decide that 99999 represents -1. Likewise, it is very useful to decide that 99998 represents -2, and so on. You have
to stop somewhere, and also by convention, the top half of the numbers
are deemed to be negative (50000-99999), and the bottom half positive
just stand for themselves (00000-49999). As a result, the top digit
being 5-9 means the represented number is negative, and it being 0-4
means the represented is positive - exactly the same as the top bit
representing sign in a two's complement binary number.
Understanding this was hard for me too. Once I got it and went back to
re-read the books articles and explanations (there was no internet
back then), it turned out a lot of those describing it didn't really
understand it. I did write a book teaching assembly language after
that (which did sell quite well for 10 years).
Two complement is found out by adding one to 1'st complement of the given number.
Lets say we have to find out twos complement of 10101 then find its ones complement, that is, 01010 add 1 to this result, that is, 01010+1=01011, which is the final answer.
Lets get the answer 10 – 12 in binary form using 8 bits:
What we will really do is 10 + (-12)
We need to get the compliment part of 12 to subtract it from 10.
12 in binary is 00001100.
10 in binary is 00001010.
To get the compliment part of 12 we just reverse all the bits then add 1.
12 in binary reversed is 11110011. This is also the Inverse code (one's complement).
Now we need to add one, which is now 11110100.
So 11110100 is the compliment of 12! Easy when you think of it this way.
Now you can solve the above question of 10 - 12 in binary form.
00001010
11110100
-----------------
11111110
Looking at the two's complement system from a math point of view it really makes sense. In ten's complement, the idea is to essentially 'isolate' the difference.
Example: 63 - 24 = x
We add the complement of 24 which is really just (100 - 24). So really, all we are doing is adding 100 on both sides of the equation.
Now the equation is: 100 + 63 - 24 = x + 100, that is why we remove the 100 (or 10 or 1000 or whatever).
Due to the inconvenient situation of having to subtract one number from a long chain of zeroes, we use a 'diminished radix complement' system, in the decimal system, nine's complement.
When we are presented with a number subtracted from a big chain of nines, we just need to reverse the numbers.
Example: 99999 - 03275 = 96724
That is the reason, after nine's complement, we add 1. As you probably know from childhood math, 9 becomes 10 by 'stealing' 1. So basically it's just ten's complement that takes 1 from the difference.
In Binary, two's complement is equatable to ten's complement, while one's complement to nine's complement. The primary difference is that instead of trying to isolate the difference with powers of ten (adding 10, 100, etc. into the equation) we are trying to isolate the difference with powers of two.
It is for this reason that we invert the bits. Just like how our minuend is a chain of nines in decimal, our minuend is a chain of ones in binary.
Example: 111111 - 101001 = 010110
Because chains of ones are 1 below a nice power of two, they 'steal' 1 from the difference like nine's do in decimal.
When we are using negative binary number's, we are really just saying:
0000 - 0101 = x
1111 - 0101 = 1010
1111 + 0000 - 0101 = x + 1111
In order to 'isolate' x, we need to add 1 because 1111 is one away from 10000 and we remove the leading 1 because we just added it to the original difference.
1111 + 1 + 0000 - 0101 = x + 1111 + 1
10000 + 0000 - 0101 = x + 10000
Just remove 10000 from both sides to get x, it's basic algebra.
The word complement derives from completeness. In the decimal world the numerals 0 through 9 provide a complement (complete set) of numerals or numeric symbols to express all decimal numbers. In the binary world the numerals 0 and 1 provide a complement of numerals to express all binary numbers. In fact The symbols 0 and 1 must be used to represent everything (text, images, etc) as well as positive (0) and negative (1).
In our world the blank space to the left of number is considered as zero:
35=035=000000035.
In a computer storage location there is no blank space. All bits (binary digits) must be either 0 or 1. To efficiently use memory numbers may be stored as 8 bit, 16 bit, 32 bit, 64 bit, 128 bit representations. When a number that is stored as an 8 bit number is transferred to a 16 bit location the sign and magnitude (absolute value) must remain the same. Both 1's complement and 2's complement representations facilitate this.
As a noun:
Both 1's complement and 2's complement are binary representations of signed quantities where the most significant bit (the one on the left) is the sign bit. 0 is for positive and 1 is for negative.
2s complement does not mean negative. It means a signed quantity. As in decimal the magnitude is represented as the positive quantity. The structure uses sign extension to preserve the quantity when promoting to a register [] with more bits:
[0101]=[00101]=[00000000000101]=5 (base 10)
[1011]=[11011]=[11111111111011]=-5(base 10)
As a verb:
2's complement means to negate. It does not mean make negative. It means if negative make positive; if positive make negative. The magnitude is the absolute value:
if a >= 0 then |a| = a
if a < 0 then |a| = -a = 2scomplement of a
This ability allows efficient binary subtraction using negate then add.
a - b = a + (-b)
The official way to take the 1's complement is for each digit subtract its value from 1.
1'scomp(0101) = 1010.
This is the same as flipping or inverting each bit individually. This results in a negative zero which is not well loved so adding one to te 1's complement gets rid of the problem.
To negate or take the 2s complement first take the 1s complement then add 1.
Example 1 Example 2
0101 --original number 1101
1's comp 1010 0010
add 1 0001 0001
2's comp 1011 --negated number 0011
In the examples the negation works as well with sign extended numbers.
Adding:
1110 Carry 111110 Carry
0110 is the same as 000110
1111 111111
sum 0101 sum 000101
SUbtracting:
1110 Carry 00000 Carry
0110 is the same as 00110
-0111 +11001
---------- ----------
sum 0101 sum 11111
Notice that when working with 2's complement, blank space to the left of the number is filled with zeros for positive numbers butis filled with ones for negative numbers. The carry is always added and must be either a 1 or 0.
Cheers
2's complement is essentially a way of coming up with the additive inverse of a binary number. Ask yourself this: Given a number in binary form (present at a fixed length memory location), what bit pattern, when added to the original number (at the fixed length memory location), would make the result all zeros ? (at the same fixed length memory location). If we could come up with this bit pattern then that bit pattern would be the -ve representation (additive inverse) of the original number; as by definition adding a number to its additive inverse always results in zero. Example: take 5 which is 101 present inside a single 8 bit byte. Now the task is to come up with a bit pattern which when added to the given bit pattern (00000101) would result in all zeros at the memory location which is used to hold this 5 i.e. all 8 bits of the byte should be zero. To do that, start from the right most bit of 101 and for each individual bit, again ask the same question: What bit should I add to the current bit to make the result zero ? continue doing that taking in account the usual carry over. After we are done with the 3 right most places (the digits that define the original number without regard to the leading zeros) the last carry goes in the bit pattern of the additive inverse. Furthermore, since we are holding in the original number in a single 8 bit byte, all other leading bits in the additive inverse should also be 1's so that (and this is important) when the computer adds "the number" (represented using the 8 bit pattern) and its additive inverse using "that" storage type (a byte) the result in that byte would be all zeros.
1 1 1
----------
1 0 1
1 0 1 1 ---> additive inverse
---------
0 0 0
Many of the answers so far nicely explain why two's complement is used to represent negative numbers, but do not tell us what two's complement number is, particularly not why a '1' is added, and in fact often added in a wrong way.
The confusion comes from a poor understanding of the definition of a complement number. A complement is the missing part that would make something complete.
The radix complement of an n digit number x in radix b is, by definition, b^n-x.
In binary 4 is represented by 100, which has 3 digits (n=3) and a radix of 2 (b=2). So its radix complement is b^n-x = 2^3-4=8-4=4 (or 100 in binary).
However, in binary obtaining a radix's complement is not as easy as getting its diminished radix complement, which is defined as (b^n-1)-y, just 1 less than that of radix complement. To get a diminished radix complement, you simply flip all the digits.
100 -> 011 (diminished (one's) radix complement)
to obtain the radix (two's) complement, we simply add 1, as the definition defined.
011 +1 ->100 (two's complement).
Now with this new understanding, let's take a look of the example given by Vincent Ramdhanie (see above second response):
Converting 1111 to decimal:
The number starts with 1, so it's negative, so we find the complement of 1111, which is 0000.
Add 1 to 0000, and we obtain 0001.
Convert 0001 to decimal, which is 1.
Apply the sign = -1.
Tada!
Should be understood as:
The number starts with 1, so it's negative. So we know it is a two's complement of some value x. To find the x represented by its two's complement, we first need find its 1's complement.
two's complement of x: 1111
one's complement of x: 1111-1 ->1110;
x = 0001, (flip all digits)
Apply the sign -, and the answer =-x =-1.
I liked lavinio's answer, but shifting bits adds some complexity. Often there's a choice of moving bits while respecting the sign bit or while not respecting the sign bit. This is the choice between treating the numbers as signed (-8 to 7 for a nibble, -128 to 127 for bytes) or full-range unsigned numbers (0 to 15 for nibbles, 0 to 255 for bytes).
It is a clever means of encoding negative integers in such a way that approximately half of the combination of bits of a data type are reserved for negative integers, and the addition of most of the negative integers with their corresponding positive integers results in a carry overflow that leaves the result to be binary zero.
So, in 2's complement if one is 0x0001 then -1 is 0x1111, because that will result in a combined sum of 0x0000 (with an overflow of 1).
2’s Complements: When we add an extra one with the 1’s complements of a number we will get the 2’s complements. For example: 100101 it’s 1’s complement is 011010 and 2’s complement is 011010+1 = 011011 (By adding one with 1's complement) For more information
this article explain it graphically.
Two's complement is mainly used for the following reasons:
To avoid multiple representations of 0
To avoid keeping track of carry bit (as in one's complement) in case of an overflow.
Carrying out simple operations like addition and subtraction becomes easy.
Two's complement is one of the ways of expressing a negative number and most of the controllers and processors store a negative number in two's complement form.
In simple terms, two's complement is a way to store negative numbers in computer memory. Whereas positive numbers are stored as a normal binary number.
Let's consider this example,
The computer uses the binary number system to represent any number.
x = 5;
This is represented as 0101.
x = -5;
When the computer encounters the - sign, it computes its two's complement and stores it.
That is, 5 = 0101 and its two's complement is 1011.
The important rules the computer uses to process numbers are,
If the first bit is 1 then it must be a negative number.
If all the bits except first bit are 0 then it is a positive number, because there is no -0 in number system (1000 is not -0 instead it is positive 8).
If all the bits are 0 then it is 0.
Else it is a positive number.
To bitwise complement a number is to flip all the bits in it. To two’s complement it, we flip all the bits and add one.
Using 2’s complement representation for signed integers, we apply the 2’s complement operation to convert a positive number to its negative equivalent and vice versa. So using nibbles for an example, 0001 (1) becomes 1111 (-1) and applying the op again, returns to 0001.
The behaviour of the operation at zero is advantageous in giving a single representation for zero without special handling of positive and negative zeroes. 0000 complements to 1111, which when 1 is added. overflows to 0000, giving us one zero, rather than a positive and a negative one.
A key advantage of this representation is that the standard addition circuits for unsigned integers produce correct results when applied to them. For example adding 1 and -1 in nibbles: 0001 + 1111, the bits overflow out of the register, leaving behind 0000.
For a gentle introduction, the wonderful Computerphile have produced a video on the subject.
The question is 'What is “two's complement”?'
The simple answer for those wanting to understand it theoretically (and me seeking to complement the other more practical answers): 2's complement is the representation for negative integers in the dual system that does not require additional characters, such as + and -.
Two's complement of a given number is the number got by adding 1 with the ones' complement of the number.
Suppose, we have a binary number: 10111001101
Its 1's complement is: 01000110010
And its two's complement will be: 01000110011
Reference: Two's Complement (Thomas Finley)
I invert all the bits and add 1. Programmatically:
// In C++11
int _powers[] = {
1,
2,
4,
8,
16,
32,
64,
128
};
int value = 3;
int n_bits = 4;
int twos_complement = (value ^ ( _powers[n_bits]-1)) + 1;
You can also use an online calculator to calculate the two's complement binary representation of a decimal number: http://www.convertforfree.com/twos-complement-calculator/
The simplest answer:
1111 + 1 = (1)0000. So 1111 must be -1. Then -1 + 1 = 0.
It's perfect to understand these all for me.

Struggling with union output

I'm trying to figure out this code for about an hour and still no luck.
#include <stdio.h>
#include <stdlib.h>
int f(float f)
{
union un {float f; int i;} u = {f};
return (u.i&0x7F800000) >> 23;
}
int main()
{
printf("%d\n", f(1));
return 0;
}
I don't understand how this work, I've tried f(1), f(2), f(3), f(4) and of course getting the different results. I've also read a lot about unions and stuff. What I have noticed that when i delete 0x7F800000 from return, results will be the same. I wanna know how u.i is generated, obviously it is not some random garbage but also it is not one (1) from function argument. What is going on here, how does it work?
This really amounts to an understanding of how floating point numbers are represented in memory. (see IEEE 754).
In short, a 32-bit floating point number will have the following structure
bit 31 will be the sign bit for the overall number
bits 30 - 23 will be exponent for the number, biased 127
bits 22 - 0 will represent the fractional part of the number. This is normalized such that the digit before the decimal (actually binary) point is one.
With regards to the union, recall that a union is a block of computer memory that can hold one of the types at at time, so the declaration:
union un
{
float f;
int i;
};
is creating a 32-bit block of memory that can either hold a floating point number or an integer, at any given time. Now when we call the function with a floating point parameter, the bit-pattern of that number is written to the memory location of un. Now when we access the union using the i member, the bit pattern is treated as an integer.
Thus, the general layout of a 32-bit floating point number is seee eeee efff ffff ffff ffff ffff ffff, whese s represents the sign bit, e the exponent bits and f the fraction bits. OK, kind of gibberish, hopefully an example might help.
To convert 4 into IEEE floating point, first convert 7 into binary (I've split te 32-bit number into 4-bit nibbles);
4 = 0000 0000 0000 0000 0000 0000 0000 0111
Now we need to normalize this, i.e. express this as a number raised to the power of two;
1.11 x 2^2
Here we need to remember that each power of two move the binary point to the right on place (analogous to dealing with powers of 10).
From this, we now can generate the bit pattern
the overall sign of the number is positive, so the overall sign bit is 0.
the exponent is 2, but we bias the exponent with 127. This means that an exponent of -127 would be stored a 0, while an exponent of 127 would be stored as 255. Thus our exponent field would be 129 or 1000 0001.
Finally our normalized number would be 1100 0000 0000 0000 0000 000 000. Notice we have dropped the leading `1' because it always assumed to be there.
Putting this all together, we have as the bit pattern:
4 = 0100 0000 1110 0000 0000 0000 0000 0000
Now, the last little bit here is the bit-wise and with 0x7F800000 which if we
write out in binary is 0111 1111 1000 0000 0000 0000 0000 0000, If we compare this to the general lay out of an IEEE floating point number, we see that what we are selecting with the mask is the exponent bits, and then we are shifting the to the left 23 bits.
So your program is just printing out the biased exponent of a floating point number. As an example,
#include <stdio.h>
#include <stdlib.h>
int f(float f)
{
union un {float f; int i;} u = {f};
return (u.i&0x7F800000) >> 23;
}
int main()
{
printf("%d\n", f(7));
return 0;
}
gives an output of 129 as we would expect.

Converting IEEE 754 Float to MIL-STD-1750A Float

I am trying to convert a IEEE 754 32 bit single precision floating point value (standard c float variable) to an unsigned long variable in the format of MIL-STD-1750A. I have included the specification for both IEEE 754 and MIL-STD-1750A at the bottom of the post. Right now, I am having issues in my code with converting the exponent. I also see issues with converting the mantissa, but I haven't gotten to fixing those yet. I am using the examples listed in Table 3 in the link above to confirm if my program is converting properly. Some of those examples do not make sense to me.
How can these two examples have the same exponent?
.5 x 2^0 (0100 0000 0000 0000 0000 0000 0000 0000)
-1 x 2^0 (1000 0000 0000 0000 0000 0000 0000 0000)
.5 x 2^0 has one decimal place, and -1 has no decimal places, so the value for .5 x 2^0 should be
.5 x 2^0 (0100 0000 0000 0000 0000 0000 0000 0010)
right? (0010 instead of 0001, because 1750A uses plus 1 bias)
How can the last example use all 32 bits and the first bit be 1, indicating a negative value?
0.7500001x2^4 (1001 1111 1111 1111 1111 1111 0000 0100)
I can see that a value with a 127 exponent should be 7F (0111 1111) but what about a value with a negative 127 exponent? Would it be 81 (1000 0001)? If so, is it because that is the two's complement +1 of 127?
Thank you
1) How can these two examples have the same exponent?
As I understand it, the sign and mantissa effectively define a 2's-complement value in the range [-1.0,1.0).
Of course, this leads to redundant representations (0.125*21 = 0.25*20, etc.) So a canonical normalized representation is chosen, by disallowing mantissa values in the range [-0.5,0.5).
So in your two examples, both -1.0 and 0.5 fall into the "allowed" mantissa range, so they both share the same exponent value.
2) How can the last example use all 32 bits and the first bit be 1, indicating a negative value?
That doesn't look right to me; how did you obtain that representation?
3) What about a value with a negative 127 exponent? Would it be 81 (1000 0001)?
I believe so.
Remember the fraction is a "signed fraction". The signed values are stored in 2's complement format. So think of the zeros as ones.
Thus the number can be written as -0.111111111111111111111 (base 2) x 2^0
, which is close to one (converges to 1.0 if my math is correct)
On the last example, there is a negative sign in the original document (-0.7500001x2^4)

IEEE 754: How exactly does it work?

Why does the following code behave as it does in C?
float x = 2147483647; //2^31
printf("%f\n", x); //Outputs 2147483648
Here is my thought process:
2147483647 = 0 1001 1101 1111 1111 1111 1111 1111 111
(0.11111111111111111111111)base2 = (1-(0.5)^23)base10
=> (1.11111111111111111111111)base2 = (1 + 1-(0.5)^23)base10 = (1.99999988)base10
Therefore, to convert the IEEE 754 notation back to decimal: 1.99999988 * 2^30 = 2147483520
So technically, the C program must have printed out 2147483520, right?
The value to be represented would be 2147483647. the next two values which can be represented this way are 2147483520 and 2147483648.
As the latter is closer to the unrepresentable "ideal one", it gets used: in floating point, the values get rounded, not truncated.
The standard is available here. You might have to purchase it, as IEEE (and other organizations like it) mainly make their money by selling the standard, to defray their costs in assembling, lobbying for acceptance, and improving the quality of the standard.
The bits only mean what someone designates them to be
"When I use a word," Humpty Dumpty said in rather a scornful tone, "it
means just what I choose it to mean -- neither more nor less." "The
question is," said Alice, "whether you can make words mean so many
different things." "The question is," said Humpty Dumpty, "which is to
be master - - that's all." (Through the Looking Glass, Chapter 6)
In this case IEEE has decided what the bits mean, and the reason that the printf flag %f prints out the right corresponding human representation is due to the flag also following the same standard.
Occasionally you can manage to cast the bits into another data type (like an int) and print out the "other" representation of those bits. C will catch a lot of the normal number promotions, but you can confuse it, generally with the assistance of assigning pointer of the wrong type to the correct address (and dereferencing them).
Note that while you are doing the math by hand, the actual hardware isn't guaranteed to do the math exactly as you would. With integer math there is much more accuracy in the representation, but with floating point math, how you round a number makes a big difference in the output. That's not even mentioning the floating point errors which sometimes were burned into systems (thankfully not often).
Floating point formats are often in a "normalized form" where the most significant bit of the mantissa is always 1. Since it's always 1, you don't need to use up a bit to store it. So when decoding such a number representation, you'll need to add back the 1 at the top.
2147483647 = 2^31 - 1 = +1 * 2^30 * 1.1111 1111 1111 1111 1111 1111 1111 11
When encoding this number in the IEEE 754-1985 single precision format, the significand is rounded properly. For the rounding mode round to nearest even (the default rounding mode) this means it gets rounded up.
Before rounding:
exponent = 30, significand = 1.1111 1111 1111 1111 1111 1111 1111 11
After rounding the significand to 23 digits after the decimal point:
exponent = 30, significand = 10.0000 0000 0000 0000 0000 000
After normalizing:
exponent = 31, significand = 1.0
Encoded in the single precision format:
1 | 10011110 | 00000000000000000000000

Resources