Converting IEEE 754 Float to MIL-STD-1750A Float - c

I am trying to convert a IEEE 754 32 bit single precision floating point value (standard c float variable) to an unsigned long variable in the format of MIL-STD-1750A. I have included the specification for both IEEE 754 and MIL-STD-1750A at the bottom of the post. Right now, I am having issues in my code with converting the exponent. I also see issues with converting the mantissa, but I haven't gotten to fixing those yet. I am using the examples listed in Table 3 in the link above to confirm if my program is converting properly. Some of those examples do not make sense to me.
How can these two examples have the same exponent?
.5 x 2^0 (0100 0000 0000 0000 0000 0000 0000 0000)
-1 x 2^0 (1000 0000 0000 0000 0000 0000 0000 0000)
.5 x 2^0 has one decimal place, and -1 has no decimal places, so the value for .5 x 2^0 should be
.5 x 2^0 (0100 0000 0000 0000 0000 0000 0000 0010)
right? (0010 instead of 0001, because 1750A uses plus 1 bias)
How can the last example use all 32 bits and the first bit be 1, indicating a negative value?
0.7500001x2^4 (1001 1111 1111 1111 1111 1111 0000 0100)
I can see that a value with a 127 exponent should be 7F (0111 1111) but what about a value with a negative 127 exponent? Would it be 81 (1000 0001)? If so, is it because that is the two's complement +1 of 127?
Thank you

1) How can these two examples have the same exponent?
As I understand it, the sign and mantissa effectively define a 2's-complement value in the range [-1.0,1.0).
Of course, this leads to redundant representations (0.125*21 = 0.25*20, etc.) So a canonical normalized representation is chosen, by disallowing mantissa values in the range [-0.5,0.5).
So in your two examples, both -1.0 and 0.5 fall into the "allowed" mantissa range, so they both share the same exponent value.
2) How can the last example use all 32 bits and the first bit be 1, indicating a negative value?
That doesn't look right to me; how did you obtain that representation?
3) What about a value with a negative 127 exponent? Would it be 81 (1000 0001)?
I believe so.

Remember the fraction is a "signed fraction". The signed values are stored in 2's complement format. So think of the zeros as ones.
Thus the number can be written as -0.111111111111111111111 (base 2) x 2^0
, which is close to one (converges to 1.0 if my math is correct)
On the last example, there is a negative sign in the original document (-0.7500001x2^4)

Related

Why is 128 in one and two's complement using 8 bits overflow?

Suppose I want to represent 128 in one and two's complement using 8 bits, no sign bit
Wouldn't that be:
One's complement: 0111 1111
Two's complement: 0111 1110
No overflow
But the correct answer is:
One's complement: 0111 1111
Two's complement: 0111 1111
Overflow
Additional Question:
How come 1 in one's and two's complement is 0000 0001 and 0000 0001 respectively. How come you don't flip the bits like we did with 128?
One's and Two's Complements are both ways to represent signed integers.
For One's Complement Representation:
Positive Numbers: Represented with its regular binary representation
For example: decimal value 1 will be represented in 8 bit One's Complement as 0000 0001
Negative Numbers: Represented by complementing the binary representation of its magnitude
For example: decimal value of -127 will be represented in 8 bit One's Complement as 1000 0000 because the binary representation of 127 is 0111 1111 when complemented that will be 1000 0000
For Two's Complement Representation:
Positive Numbers: Represented with its regular binary representation
For example: decimal value 1 will be represented in 8 bit One's Complement as 0000 0001
Negative Numbers: Represented by complementing the binary representation of its magnitude then adding 1 to the value
For example: decimal value of -127 will be represented in 8 bit One's Complement as 1000 0001 because the binary representation of 127 is 0111 1111 when complemented that will be 1000 0000 then add 0000 0001 to get 1000 0001
Therefore, 128 overflows in both instances because the binary representation of 128 is 1000 0000 which in ones complement represents -127 and in twos complement represents -128. In order to be able to represent 128 in both ones and twos complement you would need 9 bits and it would be represented as 0 1000 0000.
In 8-bit unsigned, 128 is 1000 0000. In 8-bit two's complement, that binary sequence is interpreted as -128. There is no representation for 128 in 8-bit two's complement.
0111 1110 is 126.
As mentioned in a comment, 0111 1111 is 127.
See https://www.cs.cornell.edu/~tomf/notes/cps104/twoscomp.html.
Both two's complement and one's complement are ways to represent negative numbers. Positive numbers are simply binary numbers; there is no complementing involved.
I worked on one computer with one's complement arithmetic (LINC). I vastly prefer two's complement, as there is only one representation for zero. The disadvantage to two's complement is that there is one value (-128, for 8-bit numbers) that can't be negated -- causing the overflow you're asking about. One's complement doesn't have that issue.

Struggling with union output

I'm trying to figure out this code for about an hour and still no luck.
#include <stdio.h>
#include <stdlib.h>
int f(float f)
{
union un {float f; int i;} u = {f};
return (u.i&0x7F800000) >> 23;
}
int main()
{
printf("%d\n", f(1));
return 0;
}
I don't understand how this work, I've tried f(1), f(2), f(3), f(4) and of course getting the different results. I've also read a lot about unions and stuff. What I have noticed that when i delete 0x7F800000 from return, results will be the same. I wanna know how u.i is generated, obviously it is not some random garbage but also it is not one (1) from function argument. What is going on here, how does it work?
This really amounts to an understanding of how floating point numbers are represented in memory. (see IEEE 754).
In short, a 32-bit floating point number will have the following structure
bit 31 will be the sign bit for the overall number
bits 30 - 23 will be exponent for the number, biased 127
bits 22 - 0 will represent the fractional part of the number. This is normalized such that the digit before the decimal (actually binary) point is one.
With regards to the union, recall that a union is a block of computer memory that can hold one of the types at at time, so the declaration:
union un
{
float f;
int i;
};
is creating a 32-bit block of memory that can either hold a floating point number or an integer, at any given time. Now when we call the function with a floating point parameter, the bit-pattern of that number is written to the memory location of un. Now when we access the union using the i member, the bit pattern is treated as an integer.
Thus, the general layout of a 32-bit floating point number is seee eeee efff ffff ffff ffff ffff ffff, whese s represents the sign bit, e the exponent bits and f the fraction bits. OK, kind of gibberish, hopefully an example might help.
To convert 4 into IEEE floating point, first convert 7 into binary (I've split te 32-bit number into 4-bit nibbles);
4 = 0000 0000 0000 0000 0000 0000 0000 0111
Now we need to normalize this, i.e. express this as a number raised to the power of two;
1.11 x 2^2
Here we need to remember that each power of two move the binary point to the right on place (analogous to dealing with powers of 10).
From this, we now can generate the bit pattern
the overall sign of the number is positive, so the overall sign bit is 0.
the exponent is 2, but we bias the exponent with 127. This means that an exponent of -127 would be stored a 0, while an exponent of 127 would be stored as 255. Thus our exponent field would be 129 or 1000 0001.
Finally our normalized number would be 1100 0000 0000 0000 0000 000 000. Notice we have dropped the leading `1' because it always assumed to be there.
Putting this all together, we have as the bit pattern:
4 = 0100 0000 1110 0000 0000 0000 0000 0000
Now, the last little bit here is the bit-wise and with 0x7F800000 which if we
write out in binary is 0111 1111 1000 0000 0000 0000 0000 0000, If we compare this to the general lay out of an IEEE floating point number, we see that what we are selecting with the mask is the exponent bits, and then we are shifting the to the left 23 bits.
So your program is just printing out the biased exponent of a floating point number. As an example,
#include <stdio.h>
#include <stdlib.h>
int f(float f)
{
union un {float f; int i;} u = {f};
return (u.i&0x7F800000) >> 23;
}
int main()
{
printf("%d\n", f(7));
return 0;
}
gives an output of 129 as we would expect.

How does the compiler treats printing unsigned int as signed int?

I'm trying to figure out why the following code:
{
unsigned int a = 10;
a = ~a;
printf("%d\n", a);
}
a will be 00001010 to begin with, and after NOT opertaion, will transform
into 11110101.
What happens exactly when one tries to print a as signed integer, that makes
the printed result to be -11?
I thought i would end up seeing -5 maybe (according to the binary representation), but not -11.
I'll be glad to get a clarification on the matter.
2's complement notation is used to store negative numbers.
The number 10 is 0000 0000 0000 0000 0000 0000 0000 1010 in 4 byte binary.
a=~a makes the content of a as 1111 1111 1111 1111 1111 1111 1111 0101.
This number when treated as signed int will tell the compiler to
take the most significant bit as sign and rest as magnitude.
The 1 in the msb makes the number a negative number.
Hence 2's complement operation is performed on the remaining bits.
Thus 111 1111 1111 1111 1111 1111 1111 0101 becomes
000 0000 0000 0000 0000 0000 0000 1011.
This when interpreted as a decimal integer becomes -11.
When you write a = ~a; you reverse each an every bit in a, what is also called a complement to 1.
The representation of a negative number is declared as implementation dependant, meaning that different architectures could have different representation for -10 or -11.
Assuming a 32 architecture on a common processor that uses complement to 2 to represent negative numbers -1 will be represented as FFFFFFFF (hexadecimal) or 32 bits to 1.
~a will be represented as = FFFFFFF5 or in binary 1...10101 which is the representation of -11.
Nota: the first part is always the same and is not implementation dependant, ~a is FFFFFFF5 on any 32 bits architecture. It is only the second part (-11 == FFFFFFF5) that is implementation dependant. BTW it would be -10 on an architecture that would use complement to 1 to represent negative numbers.

Fraction to right of radix - Floating point conversion

When converting a number from base 10 to binary using the floating point bit model, what determines how many zeros you "zero pad" the fraction to the right of the radix?
Take for example -44.375
It was a question on a test in my systems programming course, and below is the answer the prof provided the class with... I posted this because most comments below seem to argue what my professor states in the answer and causing some confusion.
Answer: 1 1000 0100 0110 0011 0000 0000 0000 000
-- sign bit: 1
-- fixed point: -44.375 = 25 + 23 + 22 + 2-2 + 2-3
= 101100.011
= 1.01100011 * 2<sup>5</sup>
-- exponent: 5 + 127 = 132 = 1000 0100
-- fraction: 0110 0011 0000 0000 0000 000
Marking:
-- 1 mark for correct sign bit
-- 2 marks for correct fixed point representation
-- 2 marks for correct exponent (in binary)
-- 2 marks for correct fraction (padded with zeros)
Unless the float is very small, there is no left "zero pad" of the fraction.
The sample here is -1.63 (in hexadecimal) * power(2,5 (decimal)).
The exponent is adjusted until the leading digit is 1.
printf("%a\n", -44.375);
// -0x1.63p+5
[Edit]
Your prof wants to see "2 marks for correct fraction (padded with zeros)" as the number of bits in a float, so the significand in your example is
1.0110 0011 0000 0000 0000 000
The leading 1 is not stored explicitly in a typical float.
OP "what determines how many zeros you "zero pad" the fraction to the right of the radix?
A: IEEE 754 binary32 (a popular float implementation) has a 24 bit significand. A lead bit (usually 1) and a 23-bit fraction. Thus your "right" zero padding goes out to fill 23 places.
To determine the significand of an IEEE-754 32-bit binary floating-point value:
Figure out where the leading (most significant) 1 bit is. That is the starting point. Calculate 23 more bits. If there is anything left over, round it into last of the 24 bits (carrying as necessary).
Exception: If the leading bit is less than 2-126, use the 2-126 bit as the starting point, even though it is zero.
That gives the mathematical significand. To get the bits for the significand field, remove the first bit. (And, if the exception was used, set the encoded exponent to zero instead of the normal value.)
Another exception: If the leading bit, after rounding, is 2128 or greater, the conversion overflows. Set the result to infinity.

Representation of negative numbers in C?

How does C represent negative integers?
Is it by two's complement representation or by using the MSB (most significant bit)?
-1 in hexadecimal is ffffffff.
So please clarify this for me.
ISO C (C99 section 6.2.6.2/2 in this case but it carries forward to later iterations of the standard(a)) states that an implementation must choose one of three different representations for integral data types, two's complement, ones' complement or sign/magnitude (although it's incredibly likely that the two's complement implementations far outweigh the others).
In all those representations, positive numbers are identical, the only difference being the negative numbers.
To get the negative representation for a positive number, you:
invert all bits then add one for two's complement.
invert all bits for ones' complement.
invert just the sign bit for sign/magnitude.
You can see this in the table below:
number | two's complement | ones' complement | sign/magnitude
=======|=====================|=====================|====================
5 | 0000 0000 0000 0101 | 0000 0000 0000 0101 | 0000 0000 0000 0101
-5 | 1111 1111 1111 1011 | 1111 1111 1111 1010 | 1000 0000 0000 0101
Keep in mind that ISO doesn't mandate that all bits are used in the representation. They introduce the concept of a sign bit, value bits and padding bits. Now I've never actually seen an implementation with padding bits but, from the C99 rationale document, they have this explanation:
Suppose a machine uses a pair of 16-bit shorts (each with its own sign bit) to make up a 32-bit int and the sign bit of the lower short is ignored when used in this 32-bit int. Then, as a 32-bit signed int, there is a padding bit (in the middle of the 32 bits) that is ignored in determining the value of the 32-bit signed int. But, if this 32-bit item is treated as a 32-bit unsigned int, then that padding bit is visible to the user’s program. The C committee was told that there is a machine that works this way, and that is one reason that padding bits were added to C99.
I believe that machine they may have been referring to was the Datacraft 6024 (and it's successors from Harris Corp). In those machines, you had a 24-bit word used for the signed integer but, if you wanted the wider type, it strung two of them together as a 47-bit value with the sign bit of one of the words ignored:
+---------+-----------+--------+-----------+
| sign(1) | value(23) | pad(1) | value(23) |
+---------+-----------+--------+-----------+
\____________________/ \___________________/
upper word lower word
(a) Interestingly, given the scarcity of modern implementations that actually use the other two methods, there's been a push to have two's complement accepted as the one true method. This has gone quite a long way in the C++ standard (WG21 is the workgroup responsible for this) and is now apparently being considered for C as well (by WG14).
C allows sign/magnitude, one's complement and two's complement representations of signed integers. Most typical hardware uses two's complement for integers and sign/magnitude for floating point (and yet another possibility -- a "bias" representation for the floating point exponent).
-1 in hexadecimal is ffffffff. So please clarify me in this regard.
In two's complement (by far the most commonly used representation), each bit except the most significant bit (MSB), from right to left (increasing order of magnitude) has a value 2n where n increases from zero by one. The MSB has the value -2n.
So for example in an 8bit twos-complement integer, the MSB has the place value -27 (-128), so the binary number: 1111 11112 is equal to -128 + 0111 11112 = -128 + 127 = -1
One useful feature of two's complement is that a processor's ALU only requires an adder block to perform subtraction, by forming the two's complement of the right-hand operand. For example 10 - 6 is equivalent to 10 + (-6); in 8bit binary (for simplicity of explanation) this looks like:
0000 1010
+1111 1010
---------
[1]0000 0100 = 4 (decimal)
Where the [1] is the discarded carry bit. Another example; 10 - 11 == 10 + (-11):
0000 1010
+1111 0101
---------
1111 1111 = -1 (decimal)
Another feature of two's complement is that it has a single value representing zero, whereas sign-magnitude and one's complement each have two; +0 and -0.
For integral types it's usually two's complement (implementation specific). For floating point, there's a sign bit.

Resources