addition of single precision negative floats in c [duplicate] - c

How do I subtract IEEE 754 numbers?
For example: 0,546875 - 32.875...
-> 0,546875 is 0 01111110 10001100000000000000000 in IEEE-754
-> -32.875 is 1 10000111 01000101111000000000000 in IEEE-754
So how do I do the subtraction? I know I have to to make both exponents equal but what do I do after that? 2'Complement of -32.875 mantissa and add with 0.546875 mantissa?

Really not any different than you do it with pencil and paper. Okay a little different
123400 - 5432 = 1.234*10^5 - 5.432*10^3
the bigger number dominates, shift the smaller number's mantissa off into the bit bucket until the exponents match
1.234*10^5 - 0.05432*10^5
then perform the subtraction with the mantissas
1.234 - 0.05432 = 1.17968
1.17968 * 10^5
Then normalize (which in this case it is)
That was with base 10 numbers.
In IEEE float, single precision
123400 = 0x1E208 = 0b11110001000001000
11110001000001000.000...
normalize that we have to shift the decimal place 16 places to the left so
1.1110001000001000 * 2^16
The exponent is biased so we add 127 to 16 and get 143 = 0x8F. It is a positive number so the sign bit is a 0 we start to build the IEEE floating point number the leading
1 before the decimal is implied and not used in single precision, we get rid of it and keep the fraction
sign bit, exponent, mantissa
0 10001111 1110001000001000...
0100011111110001000001000...
0100 0111 1111 0001 0000 0100 0...
0x47F10400
And if you write a program to see what a computer things 123400 is you get the same thing:
0x47F10400 123400.000000
So we know the exponent and mantissa for the first operand'
Now the second operand
5432 = 0x1538 = 0b0001010100111000
Normalize, shift decimal 12 bits left
1010100111000.000
1.010100111000000 * 2^12
The exponent is biased add 127 and get 139 = 0x8B = 0b10001011
Put it all together
0 10001011 010100111000000
010001011010100111000000
0100 0101 1010 1001 1100 0000...
0x45A9C00
And a computer program/compiler gives the same
0x45A9C000 5432.000000
Now to answer your question. Using the component parts of the floating point numbers, I have restored the implied 1 here because we need it
0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000
We have to line up our decimal places just like in grade school before we can subtract so in this context you have to shift the smaller exponent number right, tossing mantissa bits off the end until the exponents match
0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000
0 10001111 111100010000010000000000 - 0 10001100 010101001110000000000000
0 10001111 111100010000010000000000 - 0 10001101 001010100111000000000000
0 10001111 111100010000010000000000 - 0 10001110 000101010011100000000000
0 10001111 111100010000010000000000 - 0 10001111 000010101001110000000000
Now we can subtract the mantissas. If the sign bits match then we are going to actually subtract if they dont match then we add. They match this will be a subtraction.
computers perform a subtraction by using addition logic, inverting the second operator on the way into the adder and asserting the carry in bit, like this:
1
111100010000010000000000
+ 111101010110001111111111
==========================
And now just like with paper and pencil lets perform the add
1111000100000111111111111
111100010000010000000000
+ 111101010110001111111111
==========================
111001100110100000000000
or do it with hex on your calculator
111100010000010000000000 = 1111 0001 0000 0100 0000 0000 = 0xF10400
111101010110001111111111 = 1111 0101 0110 0011 1111 1111 = 0xF563FF
0xF10400 + 0xF563FF + 1 = 0x1E66800
1111001100110100000000000 =1 1110 0110 0110 1000 0000 0000 = 0x1E66800
A little bit about how the hardware works, since this was really a subtract using the adder we also invert the carry out bit (or on some computers they leave it as is). So that carry out of a 1 is a good thing we basically discard it. Had it been a carry out of a zero we would have needed more work. We dont have a carry out so our answer is really 0xE66800.
Very quickly lets see that another way, instead of inverting and adding one lets just use a calculator
111100010000010000000000 - 000010101001110000000000 =
0xF10400 - 0x0A9C00 =
0xE66800
By trying to visualize it I perhaps made it worse. The result of the mantissa subtracting is 111001100110100000000000 (0xE66800), there was no movement in the most significant bit we end up with a 24 bit number in this case with the msbit of a 1. No normalization. To normalize you need to shift the mantissa left or right until the 24 bits lines up with the most significant 1 in that left most position, adjusting the exponent for each bit shift.
Now stripping the 1. bit off the answer we put the parts together
0 10001111 11001100110100000000000
01000111111001100110100000000000
0100 0111 1110 0110 0110 1000 0000 0000
0x47E66800
If you have been following along by writing a program to do this, I did as well. This program violates the C standard by using a union in an improper way. I got away with it with my compiler on my computer, dont expect it to work all the time.
#include <stdio.h>
union
{
float f;
unsigned int u;
} myun;
int main ( void )
{
float a,b,c;
a=123400;
b= 5432;
c=a-b;
myun.f=a; printf("0x%08X %f\n",myun.u,myun.f);
myun.f=b; printf("0x%08X %f\n",myun.u,myun.f);
myun.f=c; printf("0x%08X %f\n",myun.u,myun.f);
return(0);
}
And our result matches the output of the above program, we got a 0x47E66800 doing it by hand
0x47F10400 123400.000000
0x45A9C000 5432.000000
0x47E66800 117968.000000
If you are writing a program to synthesize the floating point math your program can perform the subtract, you dont have to do the invert and add plus one thing, over complicates it as we saw above. If you get a negative result though you need to play with the sign bit, invert your result, then normalize.
So:
1) extract the parts, sign, exponent, mantissa.
2) Align your decimal places by sacrificing mantissa bits from the number with the smallest exponent, shift that mantissa to the right until the exponents match
3) being a subtract operation if the sign bits are the same then you perform a subtract, if the sign bits are different you perform an add of the mantissas.
4) if the result is a zero then your answer is a zero, encode the IEEE value for zero as the result, otherwise:
5) normalize the number, shift the answer to the right or left (The answer can be 25 bits from a 24 bit add/subtract, add/subtract can have a dramatic shift to normalize, either one right or many bits to the left) until you have a 24 bit number with the most significant one left justified. 24 bits is for single precision float. The more correct way to define normalizing is to shift left or right until the number resembles 1.something. if you had 0.001 you would shift left 3, if you had 11.10 you would shift right 1. a shift left increases your exponent, a shift right decreases it. No different than when we converted from integer to float above.
6) for single precision remove the leading 1. from the mantissa, if the exponent has overflowed then you get into building a signaling nan. If the sign bits were different and you performed an add, then you have to deal with figuring out the result sign bit. If as above everything fine you just place the sign bit, exponent and mantissa in the result
Multiply and divide is different, you asked about subract, so that is all I covered.

I'm presuming 0,546875 means 0.546875.
Firstly, to correct/clarify:
0 01111110 10001100000000000000000 = 0011 1111 0100 0110 0000 0000 0000 0000 =
0x3F460000 in IEEE-754 is 0.77343750, not 0.546875.
0.546875 in IEEE-754 is 0x3F0C0000 = 0011 1111 0000 1100 0000 0000 0000 0000 =
0 01111110 00011000000000000000000 = 1 x 1.00011 x 2^(01111110 - 127) =
1.00011 x 2^(126 - 127) = 1.00011 x 2^-1 = (1 + 1/16 + 1/32) x 1/2.
1 10000111 01000101111000000000000 = 1100 0011 1010 0010 1111 0000 0000 0000 =
0xc3a2f000 in IEEE-754 is -325.87500, not -32.875.
-32.875 in IEEE-754 is 0xC2038000 = 1100 0010 0000 0011 1000 0000 0000 0000 =
1 10000100 00000111000000000000000 = -1 x 1.00000111 x 2^(10000100 - 127) =
-1.00000111 x 2^(132 - 127) = -1.00000111 x 2^5 = (1 + 1/64 + 1/128 + 1/256) x -32.
32.875 in IEEE-754 is 0x42038000 = 0100 0010 0000 0011 1000 0000 0000 0000 =
0 10000100 00000111000000000000000 = 1 x 1.00000111 x 2^(10000100 - 127) =
1.00000111 x 2^(132 - 127) = 1.00000111 x 2^5 = (1 + 1/64 + 1/128 + 1/256) x 32.
The subtraction is carried out as follows:
1.00011000 x 1/2
- 1.00000111 x 32
------------------
==>
0.00000100011 x 32
- 1.00000111000 x 32
---------------
==>
-1 x (
1.00000111000 x 32
- 0.00000100011 x 32
---------------
)
==>
-1 x (
1.00000110112 x 32 // borrow
- 0.00000100011 x 32
---------------
)
==>
-1 x (
1.00000110112 x 32
- 0.00000100011 x 32
---------------
1.00000010101 x 32
)
==>
-1.00000010101 x 32 =
-1.00000010101000000000000 x 32 =
-1.00000010101000000000000 x 2^5 =
-1.00000010101000000000000 x 2^(132 - 127) =
-1.00000010101000000000000 x 2^(10000100 - 127)
==>
1 10000100 00000010101000000000000 =
1100 0010 0000 0001 0101 0000 0000 0000 =
0xc2015000
Note that in this example we did not need to handle underflow, which is more complicated.

Related

Can someone explains why this works to count set bits in an unsigned integer?

I saw this code called "Counting bits set, Brian Kernighan's way". I am puzzled as to how "bitwise and'ing" an integer with its decrement works to count set bits, can someone explain this?
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
Walkthrough
Let's walk through the loop with an example : let's set v = 42 which is 0010 1010 in binary.
First iteration: c=0, v=42 (0010 1010).
Now v-1 is 41 which is 0010 1001 in binary.
Let's compute v & v-1:
0010 1010
& 0010 1001
.........
0010 1000
Now v&v-1's value is 0010 1000 in binary or 40 in decimal. This value is stored into v.
Second iteration : c=1, v=40 (0010 1000). Now v-1 is 39 which is 0010 0111 in binary. Let's compute v & v-1:
0010 1000
& 0010 0111
.........
0010 0000
Now v&v-1's value is 0010 0000 which is 32 in decimal. This value is stored into v.
Third iteration :c=2, v=32 (0010 0000). Now v-1 is 31 which is 0001 1111 in binary. Let's compute v & v-1:
0010 0000
& 0001 1111
.........
0000 0000
Now v&v-1's value is 0.
Fourth iteration : c=3, v=0. The loop terminates. c contains 3 which is the number of bits set in 42.
Why it works
You can see that the binary representation of v-1 sets the least significant bit or LSB (i.e. the rightmost bit that is a 1) from 1 to 0 and all the bits right of the LSB from 0 to 1.
When you do a bitwise AND between v and v-1, the bits left from the LSB are the same in v and v-1 so the bitwise AND will leave them unchanged. All bits right of the LSB (including the LSB itself) are different and so the resulting bits will be 0.
In our original example of v=42 (0010 1010) the LSB is the second bit from the right. You can see that v-1 has the same bits as 42 except the last two : the 0 became a 1 and the 1 became a 0.
Similarly for v=40 (0010 1000) the LSB is the fourth bit from the right. When computing v-1 (0010 0111) you can see that the left four bits remain unchanged while the right four bits became inverted (zeroes became ones and ones became zeroes).
The effect of v = v & v-1 is therefore to set the least significant bit of v to 0 and leave the rest unchanged. When all bits have been cleared this way, v is 0 and we have counted all bits.
Each time though the loop one bit is counted, and one bit is cleared (set to zero).
How this works is: when you subtract one from a number you change the least significant one bit to a zero, and the even less significant bits to one -- though that doesn't matter. It doesn't matter because they are zero in the values you're decrementing, so they will be zero after the and-operation anyway.
XXX1 => XXX0
XX10 => XX01
X100 => X011
etc.
Let A=an-1an-2...a1a0 be the number on which we want to count bits and k the index of the right most bit at one.
Hence A=an-1an-2...ak+1100...0=Ak+2k where Ak=an-1an-2...ak+1000...0
As 2k−1=000..0111..11, we have
A-1=Ak+2k-1=an-1an-2...ak+1011...11
Now perform the bitwise & of A and A-1
an-1an-2...ak+1100...0 A
an-1an-2...ak+1011...1 A-1
an-1an-2...ak+1000...0 A&A-1=Ak
So A&A-1 is identical to A, except that its right most bit has been cleared, that proves the validity of the method.

difficulty understanding signed not

I'm having trouble understanding why c equals -61 on the following program:
main() {
unsigned int a = 60; // 60 = 0011 1100
unsigned int b = 13; // 13 = 0000 1101
int c = 0;
c = ~a; //-61 = 1100 0011
printf("Line 4 - Value of c is %d\n", c );
}
I do understand how the NOT operator works on 0011 1100 (the solution being 1100 0011). But I'm not sure why the decimal number is increased by 1. Is this some sort of type conversion from unsigned int (from a) into signed int (from c) ?
Conversion from a positive to a negative number in twos complement (the standard signed format) constitutes a bitwise inversion, and adding one.
Note that for simplicity I am using a single signed byte.
So if 60 = 0011 1100
Then c = 1100 0011 + 1
= 1100 0100
And for a signed byte, the most significant bit is negative,
so
c = -128 + 64 + 4 = -60
You need to add 1 to account for the fact that the most significant bit is -128, while the largest positive number is 0111 1111 = 127. All negative numbers have a 1 for -128 which needs to be offset.
This is easy to see when you look at converting 0 to -0. Invert 00000000 and you get 11111111 and adding one gets you back to 00000000. Do the same with 1 to -1 and you get 11111111 - the largest possible negative number.

A question to C operations :return 1 when all bits of byte i of x equal 1; 0 otherwise

You are asked to complete the following C function:
/* Return 1 when all bits of byte i of x equal 1; 0 otherwise. */
int allBits_ofByte_i(unsigned x, int i) {
return _____________________ ;
}
My solution: !!(x&(0xFF << (i<<3)))
The correct answer to this question is:
!~(~0xFF | (x >> (i << 3 ))
Can someone explain it?
Also, can someone take a look at my answer, is it right?
The expression !~(~0xFF | (x >> (i << 3 )) is evaluated as follows.
i<<3 multiplies i by 8 to get a number of bits which will be 0, 8, 16, or 24, depending on which byte the caller wants to test. This is actually the number of bits to ignore, as it is the number of bits that are less significant than the byte we're interested it.
(x >> ...) shifts the test value right to eliminate the low bits that we're not interested in. The 8 bits of interest are now the lowest 8 bits in the unsigned value we're evaluating. Note that other higher bits may or may not be set.
(~0xFF | ...) sets all 24 bits above the 8 we're interested in, but does not alter those 8 bits. (~0xFF is a shorthand for 0xFFFFFF00, and yes, arguably 0xFFu should be used).
~(...) flips all bits. This will result in a value of zero if every bit was set, and a non-zero value in every other case.
!(...) logically negates the result. This will result in a value of 1 only if every bit was set during step 3. In other words, every bit in the 8 bits we were interested in was set. (The other 24 bits were set in step 3.)
The algorithm can be summed up as, set the 24 bits we're not interested in, then verify that 32 bits are set.
Your answer took a slightly different approach, which was to shift the 0xFF mask left rather than shift the test value right. That was my first thought for how to approach the problem too! But your logical negation doesn't verify that every bit is set, which is why your answer wouldn't produce correct results in all cases.
x is of unsigned integer type. Let's say that x is (often) 32 bit.
One byte consists of 8 bits. So x has 4 bytes in this case: 0, 1, 2 or 3
According to the solution the endianness of the architecture can be imagined as follows:
x => bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
i => 3 2 1 0
I will try to break it down:
!~ ( ~0xFF | ( x >> (i << 3) ) )
i can be either 0, 1, 2 or 3. So i << 3 would either give you 0, 8, 16 or 24. (i << n is like multiplying by 2^n; it means shift i to the left n times putting 0).
Note that 0, 8, 16 and 24 are the byte segments: 0-7, 8-15, 16-23, 24-31
This is used to ...
x >> (i<<3) shifts to the right x by that result (0, 8, 16 or 24 times). So that the corresponding byte denoted by the i parameter occupies now the right most bits.
Until now you manipulated x so that the byte you are interested in is located on the right most 8 bits (the right most byte).
~0xFF is the inversion of 0000 0000 0000 0000 0000 0000 1111 1111 which gives you 1111 1111 1111 1111 1111 1111 0000 0000
The bitwise or operator is applied to the two results above, which would result in
1111 1111 1111 1111 1111 1111 abcd efgh - the letters being the bits of the corresponding byte of x.
~1111 1111 1111 1111 1111 1111 abcd efgh will turn into 0000 0000 0000 0000 0000 0000 ABCD EFGH - the capital letters being the inverse of the lower letters' values.
!0000 0000 0000 0000 0000 0000 ABCD EFGH is a logical operation. !n is 1 if n is 0, and it is 0 if n is otherwise.
So you get a 1 if all the inverted bits of the corresponding byte were 0000 0000 (i.e. the byte is 1111 1111).
Otherwise you get a 0.
In the C programming language a result of 0 corresponds to a boolean false value. And a result different than 0 corresponds to a boolean true value.

Is there an easier way to extract certain bits from a hex value in string format?

I feel like I'm way over-thinking this and that there must be a simpler way.
Let's say I have a string in hexidecimal format, like:
"0x1fffff51"
(Binary: 0001 1111 1111 1111 1111 1111 0101 0001)
Now, depending on some information I have, I need to extract the the first x binary bits, then the next y binary bits, then the remaining 32 - x - y binary bits.
Let's say x = 5, y = 7
So I would want:
x = 10001 = 17
y = 1111010 = 122
z = 0001 1111 1111 1111 1111 = 131071
The way I was thinking of approaching it was:
Convert the string into a decimal using strtoul()
Create a new mask dynamically using mask = 2^x - 1 (all 1's, length of x number of bits)
3 Bitwise and MASK and the decimal value (store as X)
Create a new mask dynamically using mask = 2^y - 1
Shift the decimal number x bits to the right
Bitwise and MASK and the decimal value (store as Y)
Shift the decimal number y bits to the right, store as Z
I'm almost 100% positive bit shifting works on decimal numbers, so I don't think that'll be a problem.

C - data type conversion without computer

I have a sample question from test from my school. Which way is the most simple for solving it on paper?
The question:
Run-time system uses two's complement for representation of integers. Data type int has size 32 bits, data type short has size 16 bits. What does printf show? (The answer is ffffe43c)
short int x = -0x1bc4; /* !!! short */
printf ( "%x", x );
lets make it in two steps: 1bc4 = 1bc3 + 1
first of all we make this on long:
0 - 1 = ffffffff
then
ffffffff - 1bc3
this can be done by symbols
ffffffff
-
00001bc3
you will get the result you have
Since your x is negative take the two's complement of it which will yield:
2's(-x) = ~(x) + 1
2's(-0x1BC4) = ~(0x1BC4) + 1 => 0xE43C
0x1BC4 = 0001 1011 1100 0100
~0X1BC4 =1110 0100 0011 1011
+1 = [1]110 0100 0011 1100 (brackets around MSB)
which is how your number is represented internally.
Now %x expects a 32-bit integer so your computer will sign-extend your value which copies the MSB to the upper 16 bits of your value which will yield:
1111 1111 1111 1111 1110 0100 0011 1100 == 0xFFFFE43C

Resources