I am trying to solve a multiplication problem with fixed point numbers. The numbers are 32 bit. My architecture is 8 bit. So here goes:
I am using 8.8 notation i.e., 8 for integer, 8 for fraction.
I have A78 which is 10.468. I take its two's complement, and the answer is FFFFF588, which I truncate to 16 bits as F588 and store it. Reason being, I only want to multiply two, 2 byte numbers.
Now when I multiply this F588 (negative 10.42 or 0x0A78) with 0xFF4B which is the two's compliment of 0x00B5 (0.707), answer should be 0x0766. Or something like it.
What I get on the other hand is 66D8.
Now here is where it gets interesting: If I store negative of B5 in two's compliment in 32 bits, I get 0xFF5266D8 which I shift right by 8 bits, truncate then to 16 bits, and answer is 0x5266.
On the other hand if I instead store my negative 10.42 in 32 bits, I get 0xF58F66D8, which after shifting 8 bits and truncating becomes 8F66.
But, if I store both numbers in 32 bit formats, only then I get the correct result after shifting and truncation, which is 0x0766.
Why is this happening? I understand that loss of information is intrinsic when we go from 32 to 16 bits, but 0x07 is much different from 0x55. I will be absolutely grateful for a response.
Let’s look at just the integer representations. You have two 16-bit integers, x and y, and you form their 16-bit two’s complements. However, you keep these 16-bit complements in 32-bit objects. In 32 bits, what you have is 65536–x and 65536–y. (For example, you started with 0xa78, complemented it to make 0xfffff588, and discarded bits to get 0xf588. That equals 0x10000-0xa78.)
When you multiply these, the result is 65536•65536 – 65536•x – 65536•y + x•y.
The 65536•65536 is 232, so it vanishes because unsigned 32-bit arithmetic is performed modulo 232. You are left with – 65536•x – 65536•y + x•y.
Now you can see the problem: x•y is the product of two 16-bit values, so it flows into the high 16 bits of the 32 bits. Up there, you still have – 65536•x – 65536•y, which you do not want.
An easy way to do this is multiplication to keep all 32 bits of the complement. E.g., when you took the two’s complement of 0xa78, you got 0xfffff588. Then you discarded the high bits, keeping only 0xf588. If you do not do that, you will multiply 0xfffff588 by 0xffffff4b, and the product will be 0x766d8 which, when shifted for the fraction, will be 0x766, which is the result you want.
If the high bits are lost because you stored the two’s complement into a 16-bit object, then simply restore them when you reload the object, by extending the sign bit. That is, take bit 15 and repeat it in bits 16 to 31. An easy way to do this is to load the 16-bit object into a 16-bit signed integer, then convert the 16-bit signed integer to an unsigned 32-bit integer.
Related
In a 32-bit C programming environment without any built-in native 64-bit support, but some parts of the operating system API's do support 64-bit numbers, I have to represent a negative value (range -8192 upto and including -1). Presumably I have to do the maths myself.
typedef struct _LONGLONG {ULONG lo;LONG hi} LONGLONG;
LONGLONG x;
How can I assign such a negative value to x.lo and x.hi, and (less important) perhaps how can I verify it's the right number?
I've read about "What's wrong with quadpart?", but apparently there's no such thing available.
Bits represent values only according to some type scheme. The question does not state what scheme is used to make the bits in lo and hi represent values. For the purposes of this answer, we will suppose:
LONG hi is a 32-bit two’s complement integer.
ULONG lo is a 32-bit unsigned integer.
The API using this LONGLONG structure interprets lo and hi as a 64-bit two’s complement number formed with the bits of hi as the high 32 bits of the 64-bit number and the bits of low as the low 32 bits.
In this case, if n is a value in [−8192, −1], the upper 32 bits of a 64-bit two’s complement representation are all ones, and the lower 32 bits are the same as the 32 bits of a 32-bit two’s complement representation of n. Therefore, the desired bit patterns of the LONGLONG x can be set with:
x.hi = -1;
x.lo = n;
Note that, since x.lo is unsigned, the assignment x.lo = n will convert the signed n to unsigned. This conversion will be done by wrapping modulo 232, and the resulting 32-bit unsigned value for x.lo has the same bit pattern as the 32-bit two’s complement value for n, so the desired result is achieved.
I am learning about data allocation and am a little confused.
If you are looking for the smallest or greatest value that can be stored in a certain number of bits then does it matter what the data type is?
Wouldn't the smallest or biggest number that could be stored in 22 bits would be 22 1's positive or negative? Is the first part of this question a red herring? Wouldn't the smallest value be -4194303?
A 22-bit data element can store any one of 2^22 distinct values. What those values actually mean is a matter of interpretation. That interpretation may be imposed by a compiler or some piece of hardware, or may be under the control of the programmer, and suit some specific application.
A simple interpretation, of course, would be to treat the 22 bits as an unsigned integer, with values from 0 to (2^22)-1. A two's-complement, signed integer is a slightly more sophisticated interpretation of the same bits. Or you (or the compiler, or CPU) could divide the 22 bits up into a mantissa and exponent, and store a range of decimal numbers. The range and precision would depend on how many bits were allocated to the mantissa, and how many to the exponent.
Or you could split the bits up and use some for the numerator and some for the denominator of a fraction. Or, in fact, anything else.
Some of these interpretations of the bits are built into hardware, some are implemented by compilers or libraries, and some are entirely under the programmer's control. Not all programming languages allow the programmer to manipulate individual bits in a natural or efficient way, but some do. Sometimes, using a highly unconventional interpretation of binary data can give significant efficiency gains, but usually at the expense of readability and maintainability.
So, yes, it matters what the data type is.
There is no law (of humans, logic, or nature) that says bits must represent numbers only in the pattern that one of the bits represents 20, another represents 21, another represents 22, and so on (and the number represented is the sum of those values for the bits that are 1). We have choices about how to use bits to represent numbers, including:
The bits do use that pattern, and so 22 bits can represent any number from 0 to the sum of 20 + 21 + 22 + … + 221 = 222 − 1 = 4,194,303. The smallest representable value is 0.
The bits mostly use that pattern, but it is modified so that one bit represents −221 instead of +221. This is called two’s complement, and the smallest value representable is −221 = −2,097,152.
The bits represent numbers as described above except the represent value is divided by 1000. This is called fixed-point. In the first case, the value represent by all bits 1 would be 4194.303, but the smallest representable value would be 0. With a combination of two’s complement and fixed-point scaled by 1/1000, the smallest representable value would be −2097.152.
The bits represent a floating-point number, where one bit represents a sign (+ or −), certain bits represent an exponent and other information, and the remaining bits represent a significand. In common floating-point formats, when all the bits in that exponent-and-other field are 1s and the significand field bits are 0s, the number represents +∞ or −∞, according to the sign bit. In such a format, the smallest representable value is −∞.
As an example, we could designate patterns of bits to represent numbers arbitrarily. We could say that 0000000000000000000000 represents 34, 0000000000000000000001 represents −15, 0000000000000000000010 represents 5, 0000000000000000000011 represents 3+4i, and so on. The smallest representable value would be whichever of those arbitrary values is smallest.
So what the smallest representable value is depends entirely on the type, since the “type” of the data includes the scheme by which the bits represent values.
If the type is a “signed integer type,” there is still some flexibility in the representation. Most modern C implementations (and other programming languages) use the two’s complement scheme described above. But the C standard still allows two other schemes:
One’s complement: If the first bit is 1, the value represented is negative, and its magnitude is given by complementing the remaining bits and interpreting them as binary. Using six bits for an example, 101001 would be negative with the magnitude of 101102 = 22, so −22.
Sign-and-magnitude: If the first bit is 1, the value represented is negative, and its magnitude is given by interpreting the remaining bits as binary. Using the same bits, 101001 would negative with the magnitude of 010012 = 9, so −9.
In both one’s complement and sign-and-magnitude, the smallest representable value with 22 bits is −(221−1) = −2,097,151.
To stretch the question further, C defines standard integer types but allows implementations to extend the language. An implementation could define some “signed integer type” with an arbitrary scheme for representing numbers, as long as that scheme included a sign, to make the name correct.
Without going into technical jargon about doing maths with Two's compliment, I'll try to explain in easy words.
First you need to raise 2 with power of 'number of bits'.
Let's take an example of an 8 bit type,
An un-signed 8-bit integer can store 2 ^ 8 = 256 values.
Since values are indexed starting from 0, so values range from 0 - 255.
Assuming you want to store signed values, so you need to get the half (simply divide it by 2),
256 / 2 = 128.
Remember we start from zero,
You might be rightly thinking you can store -127 to 127 starting from zero on both sides.
Just know that there is only zero (there is nothing like +0 or -0),
so you start with zero to positive half. 0 to 127,
that leaves you with negative half starting from -1 to -128
Hence the range will be -128 to 127.
For a 22 bit signed integer you can do the math,
2 ^ 22 = 4,194,304
4194304 / 2 = 2,097,152
-1 for positive side,
range will be, -2097152 to 2097151.
To answer your question,
-2097152 would be the smallest number you can store.
Thanks everyone for the replies. I figured it out with the help of all of your info but I will explain the answer to show exactly what gaps of knowledge I had that lead to my misunderstanding.
The data type does matter in this question because for signed data types the first bit is used to represent whether or not a binary number is positive or negative. 0111 = 7 and 1111 = -7
sign int and unsigned int use the same number of bits, 32 bits. Since an unsigned int is unsigned: the first bit isn't used to represent positive or negative so it can represent a larger number with that extra bit. 1111 converted to an unsigned int is 15 whereas with the signed int it was -7 since the furthest left bit represents the sign: 1 is negative and 0 is positive.
Now to answer "If a C signed integer type is stored in 22 bits, what is the smallest value it can store?":
If you convert binary to decimal you get 1111111111111111111111 = 4194304
This decimal value -1 is the maximum value an unsigned could hold. Since our data type is signed it has to use one less bit for the number value since the first bit represents the sign. This gives us -2097152.
Thanks again, everyone.
I have a sample set of 32-bit data in the format 24-bit, 2’s complement, MSB first. The data precision is 18 bits; unused bits are zeros.
I want to process the numbers in this sample set to find their average value.
However, I am not sure how to convert all the numbers to the same type and then use them for calculating the average.
One way is to bit shift all the numbers from the sample set to right by 14. This way I directly get the 18 useful bits (since it is 24-bit data with 18-bit precision, so extract only 18 useful bits).
Then I can directly use these numbers to calculate their average.
Following is an example sample set of data samples:-
0xF9AFC000
0xF9AFC000
0xF9AE4000
0xF9AE0000
0xF9AE0000
0xF9AD0000
0xF9AC8000
0xF9AC8000
0xF9AC4000
0xF9AB4000
0xF9AB8000
0xF9AB4000
0xF9AA4000
0xF9AA8000
0xF9A98000
0xF9A8C000
0xF9A8C000
0xF9A8C000
0xF9A88000
0xF9A84000
However, the 18-bit number still has the sign bit (MSB). This bit is not always set and might be 0 or 1 depending upon the data.
Should I just mask the sign bit by &ing all the numbers with 0x1FFFF and use them for calculating average?
Or should I first convert them from 2's complement to integers by negating and adding 1?
Please suggest a proper way to extract and process "24-bit, 2’s complement, MSB first" number from a 32-bit number.
Thanks in advance!
Well, providing sample data isn't a complete spec, but let's look at
F9AFC000
It looks like the data are in the high order 3 bytes. That's a guess. If they're indeed 24 bits of 2's complement, then getting the true value into a 32 bit int is just
int32_t get_value_from_datum(uint32_t datum) {
return (int32_t) datum >> 8;
}
On the sample, this will sign extend the high bit of the leading F. The result will be FFF9AFC0. As a 2's complement integer written in base 10, this is -413760.
Or perhaps you mean that the 18 bits of interest are fully left-justified in the 32-bit word. Then it's
int32_t get_value_from_datum(uint32_t datum) {
return (int32_t) datum >> 14;
}
This results in -6465.
As I said in the comment, you need to more clearly explain the data format.
A precise spec is most easily shown as a picture of the 32-bit word, MSB to LSB, which identifies which 18 bits are the data bits.
I'm actually researching how to display a float number (with write) and I'm facing about something which is confusing me.
I found that float are stored in 32 bits, whith 1 bits for sign, 7 bits for exponant and the rest for the Mantissa.
Where my trouble are coming, is when I display FLT_MAX with printf, I will get 340282346638528859811704183484516925440.000000 by simply doing
printf("%f\n", FLT_MAX)
This value is bigger than INT_MAX, bigger than LLONG_MAX, how can this number of digit can be stored in 32 bits ? This is really 32 bits or system dependent ? I'm on Ubuntu x86_64 GNU/Linux.
I can't understand how more than 10 digits (INT_MAX len) can be stored in the same number of bits.
If think the problem is linked, but I also have trouble for double who will give me
printf("%lf", DBL_MAX);
#179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
It's making the mystery bigger !
Thanks for helping, hope I was clear.
Bits are merely physical things with two states. There is no inherent meaning to them. When we use the bits to represent an integer in binary, we interpret each bit as having a value, 1 for one bit, 2 for another, 4 for another, 8 for another, and so on. There is nothing in physics, logic, or law that requires us to give them this interpretation.
When we use the bits to represent a floating-point object, we give each bit a different meaning. One bit represents the sign. Eight bits contain an encoding of the exponent. 23 bits contain an encoding of the significand.
To figure out the meaning of the bits given the floating-point encoding scheme for numbers in the normal range, we interpret the exponent bits as a binary numeral, then subtract 127, then raise two to the resulting power. (For example, “10000011” is the binary numeral for 131, so it represents 24) Then we take the significand bits and append them to “1.”, forming a binary numeral such as “1.01011100000000000000000”. We convert that numeral to a number (it is 159/128), and we multiply it by the power from the exponent (producing 159/8 in this example) and apply the sign.
Since the exponent can be large, the value represented can be very large. The software that converts floating-point numbers to characters for output such as “340282346638528859811704183484516925440.000000” performs these interpretations for you.
What is the result of adding the binary numbers 01000001 and 11111111 on an 8 bit machine?
If we are supposed to interpret this with the rules of C (it is tagged as such), for the signed case there are three interpretations of these numbers possible, corresponding to the three sign representations that are allowed in C.
For the unsigned case the standard requires that unsigned arithmetic wraps silently. All computation is done modulo 256 in that case.
Integer overflow.
If the numbers are unsigned (i.e. modular), 0100000 (with modular 8-bit math, addition of 11111111 is equal to subtraction of 1).
If both values are unsigned, then the result is 320 in decimal. Both operands are promoted to int before the addition, and int is required by the standard to have at least 16 bits, even on an 8 bit machine. The question doesn't make any restrictions for the result.
Unless you want the "wrong result fast", the answer is 320.
Correctly adding two numbers (in whatever representation) anywhere (including 8-bit machines) results in a unique number that can be represented in a multitude of different ways.
320 can be represented as 320 (usual decimal (base-10) representation) or 101000000 (binary representation) or 253413120100 (factoradic), ...
I think you just add the numbers, then cut the overflowing bits (from the left)
if it's only 8 bit, the maximum you can have is 255 (1111 1111) if the value is unsigned and 127 if it is signed (-128 being the lowest). Therefore, doing this addition will cause overflow, which goes back to 0 and then keeps counting. Think of it as your car miles meter: if there can only be, say, 8 digits on the counter, and your counter is at 99 999 999 miles, if you add one more, the counter will go back to 0.
If these are signed integers, they represent 65 and -128 -1. Adding them will give -63 64.
If these are unsigned integers, they represent 65 and 255. Since the sum can not be represented in 8 bits, the result will be 64 and the overflow bit will be set.