32-bit fixed point arithmetic 1x1 does not equal 1 - c

I'm implementing 32-bit signed integer fixed-point arithmetic. The scale is from 1 to -1, with INT32_MAX corresponding to 1. I'm not sure whether to make INT32_MIN or -INT32_MAX correspond to -1, but that's an aside for now.
I've made some operations to multiply and round, as follows:
#define mul(a, b) ((int64_t)(a) * (b))
#define round(x) (int32_t)((x + (1 << 30)) >> 31)
The product of two numbers can then be found using round(mul(a, b)).
The issue comes up when I check the identity.
The main problem is that 1x1 is not 1. It's INT32_MAX-1. That's obviously not desired as I would like bit-accuracy. I suppose this would affect other nearby numbers so the fix isn't a case of just adding 1 if the operands are both INT32_MAX.
Additionally, -1x-1 is not -1, 1x-1 is not -1, and -1x-1=-1. So none of the identities hold up.
Is there a simple fix to this, or is this just a symptom of using fixed point arithmetic?

In its general form, a fixed-point format represents a number x as an integer x•s. Commonly, s is a power of some base b, s = bp. For example, we might store a number of dollars x as x•100, so $3.45 might be stored as 345. Here we can easily see why this is called a “fixed-point” format: The stored number conceptually has decimal point inserted as a fixed position, in this case two to the left of the rightmost digit: “345” is conceptually “3.45”. (This may also be called a radix point rather than a decimal point, allowing for cases when the base b is not ten. And p specifies where the radix-point is inserted, p base-b digits from the right.)
If you make INT_MAX represent 1, then you are implicitly saying s = INT_MAX. (And, since INT_MAX is not a power of any other integer, we have b = INT_MAX and p = 1.) Then −1 must be represented by −1•INT_MAX = -INT_MAX. It would not be represented by INT_MIN (except in archaic C implementations where INT_MIN = -INT_MAX).
Given s = INT_MAX, shifting by 31 bits is not a correct way to implementation multiplication. Given two numbers x and y with representations a and b, the representation of xy is computed by multiplying the representations a and b and dividing by s:
a represents x, so a = xs.
b represents y, so b = ys.
Then ab/s = xsys/s = xys, and xys represents xy.
Shifting by 31 divides by 231, so that is not the same as dividing by INT_MAX. Also, division is generally slow in hardware. You may be better off choosing s = 230 instead of INT_MAX. Then you could shift by 30 bits.
When calculating ab/s, we often want to round. Adding ½s to the product before dividing is one method of rounding, but it is likely not what you want for negative products. You may want to consider adding −½s if the product is negative.

Related

What happens if we keep dividing float 1.0 by 2 until it reaches zero?

float f = 1.0;
while (f != 0.0) f = f / 2.0;
This loop runs 150 times using 32-bit precision. Why is that so? Is it getting rounded to zero?
In common C implementations, the IEEE-754 binary32 format is used for float. It is also called “single precision.” It is a binary based format where finite numbers are represented as ±f•2e, where f is a 24-bit binary numeral in [1, 2) and e is an integer in [−126, 127].
In this format, 1 is represented as +1.000000000000000000000002•20. Dividing that by 2 yields ½, which is represented as +1.000000000000000000000002•2−1. Dividing that by 2 yields +1.000000000000000000000002•2−2, then +1.000000000000000000000002•2−3, and so on until we reach +1.000000000000000000000002•2−126.
When that is divided by two, the mathematical result is +1.000000000000000000000002•2−127, but −127 is below the normal exponent range, [−126, 127]. Instead, the significand becomes denormalized; 2−127 is represented by +0.100000000000000000000002•2−126. Dividing that by 2 yields +0.010000000000000000000002•2−126, then +0.001000000000000000000002•2−126, +0.000100000000000000000002•2−126, and so on until we get to +0.000000000000000000000012•2−126.
At this point, we have done 149 divisions by 2; +0.000000000000000000000012•2−126 is 2−149.
When the next division is performed, the result would be 2−150, but that is not representable in this format. Even with the lowest non-zero significand, 0.000000000000000000000012, and the lowest exponent, −126, we cannot get to 2−150. The next lower representable number is +0.000000000000000000000002•2−126, which equals 0.
So, the real-number-arithmetic result of the division would be 2−150, but we cannot represent that in this format. The two nearest representable numbers are +0.000000000000000000000012•2−126 just above it and +0.000000000000000000000002•2−126 just below it. They are equally near 2−150. The default rounding method is to take the nearest representable number and, in case of ties, to take the number with the even low digit. So +0.000000000000000000000002•2−126 wins the tie, and that is produced as the result for the 150th division.
What happens is simply that your system has only a limited number of bits available for a variable, and hence limited precision; even though, mathematically, you can halve a number (!= 0) indefinitely without ever reaching zero, in a computer implementation that has a limited precision for a float variable, that variable will inevitably, at some stage, become indistinguishable from zero. The more bits your system uses, the more precision it has and the later this will happen, but at some stage it will.
Since I suppose this is meant to be C, I just implemented it in C (with a counter counting each iteration), and indeed it ran for 150 rounds until the loop ended. I also implemented it with a double, where it ran for 1075 iterations. Keep in mind, however, that the C standard does not define the exact precision of a float variable. In most implementations it's 32 bits for a float and 64 for a double. With a long double, I get 16,446 iterations.

Problem of a simple float number serialization example

I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
    ½ → 1734/3467 → 1734
    ⅓ → 1155/3467 → 1155
    0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
    1734 → 1734 ÷ 3467 → 0.500144
    1155 → 1155 ÷ 3467 → 0.333141
    433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...

The max number of digits in an int based on number of bits

So, I needed a constant value to represent the max number of digits in an int, and it needed to be calculated at compile time to pass into the size of a char array.
To add some more detail: The compiler/machine I'm working with has a very limited subset of the C language, so none of the std libraries work as they have unsupported features. As such I cannot use INT_MIN/MAX as I can neither include them, nor are they defined.
I need a compile time expression that calculates the size. The formula I came up with is:
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
It is marginally successful with n byte integers based on hand calculating it.
sizeof(int) INT_MAX characters formula
2 32767 5 7
4 2147483647 10 12
8 9223372036854775807 19 22
You're looking for a result related to a logarithm of the maximum value of the integer type in question (which logarithm depends on the radix of the representation whose digits you want to count). You cannot compute exact logarithms at compile time, but you can write macros that estimate them closely enough for your purposes, or that compute a close enough upper bound for your purposes. For example, see How to compute log with the preprocessor.
It is useful also to know that you can convert between logarithms in different bases by multiplying by appropriate constants. In particular, if you know the base-a logarithm of a number and you want the base-b logarithm, you can compute it as
logb(x) = loga(x) / loga(b)
Your case is a bit easier than the general one, though. For the dimension of an array that is not a variable-length array, you need an "integer constant expression". Furthermore, your result does not need more than two digits of precision (three if you wanted the number of binary digits) for any built-in integer type you'll find in a C implementation, and it seems like you need only a close enough upper bound.
Moreover, you get a head start from the sizeof operator, which can appear in integer constant expressions and which, when applied to an integer type, gives you an upper bound on the base-256 logarithm of values of that type (supposing that CHAR_BIT is 8). This estimate is very tight if every bit is a value bit, but signed integers have a sign bit, and they may have padding bits as well, so this bound is a bit loose for them.
If you want a a bound on the number of digits in a power-of-two radix then you can use sizeof pretty directly. Let's suppose, though, that you're looking for the number of decimal digits. Mathematically, the maximum number of digits in the decimal representation of an int is
N = ceil(log10(MAX_INT))
or
N = floor(log10(MAX_INT)) + 1
provided that MAX_INT is not a power of 10. Let's express that in terms of the base-256 logarithm:
N = floor( log256(MAX_INT) / log256(10) ) + 1
Now, log256(10) cannot be part of an integer constant expression, but it or its reciprocal can be pre-computed: 1 / log256(10) = 2.40824 (to a pretty good approximation; the actual value is slightly less). Now, let's use that to rewrite our expression:
N <= floor( sizeof(int) * 2.40824 ) + 1
That's not yet an integer constant expression, but it's close. This expression is an integer constant expression, and a good enough approximation to serve your purpose:
N = 241 * sizeof(int) / 100 + 1
Here are the results for various integer sizes:
sizeof(int) INT_MAX True N Computed N
1 127 3 3
2 32767 5 5
4 2147483648 10 10
8 ~9.223372037e+18 19 20
(The values in the INT_MAX and True N columns suppose one of the allowed forms of signed representation, and no padding bits; the former and maybe both will be smaller if the representation contains padding bits.)
I presume that in the unlikely event that you encounter a system with 8-byte ints, the extra one byte you provide for your digit array will not break you. The discrepancy arises from the difference between having (at most) 63 value bits in a signed 64-bit integer, and the formula accounting for 64 value bits in that case, with the result that sizeof(int) is a bit too much of an overestimation of the base-256 log of INT_MAX. The formula gives exact results for unsigned int up to at least size 8, provided there are no padding bits.
As a macro, then:
// Expands to an integer constant expression evaluating to a close upper bound
// on the number the number of decimal digits in a value expressible in the
// integer type given by the argument (if it is a type name) or the the integer
// type of the argument (if it is an expression). The meaning of the resulting
// expression is unspecified for other arguments.
#define DECIMAL_DIGITS_BOUND(t) (241 * sizeof(t) / 100 + 1)
An upper bound on the number of decimal digits an int may produce depends on INT_MIN.
// Mathematically
max_digits = ceil(log10(-INT_MAX))
It is easier to use the bit-width of the int as that approximates a log of -INT_MIN. sizeof(int)*CHAR_BIT - 1 is the max number of value bits in an int.
// Mathematically
max_digits = ceil((sizeof(int)*CHAR_BIT - 1)* log10(2))
// log10(2) --> ~ 0.30103
On rare machines, int has padding, so the above will over estimate.
For log10(2), which is about 0.30103, we could use 1/3 or one-third.
As a macro, perform integer math and add 1 for the ceiling
#include <stdlib.h>
#define INT_DIGIT10_WIDTH ((sizeof(int)*CHAR_BIT - 1)/3 + 1)
To account for a sign and null character add 2, use the following. With a very tight log10(2) fraction to not over calculate the buffer needs:
#define INT_STRING_SIZE ((sizeof(int)*CHAR_BIT - 1)*28/93 + 3)
Note 28/93 = ‭0.3010752... > log2(10)
The number of digits needed for any base down to base 2 would need follows below. It is interesting that +2 is needed and not +1. Consider a 2 bit signed number in base 2 could be "-10", a size of 4.
#define INT_STRING2_SIZE ((sizeof(int)*CHAR_BIT + 2)
Boringly, I think you need to hardcode this, centred around inspecting sizeof(int) and consulting your compiler documentation to see what kind of int you actually have. (All the C standard specifies is that it can't be smaller than a short, and needs to have a range of at least -32767 to +32767, and 1's complement, 2's complement, and signed magnitude can be chosen. The manner of storage is arbitrary although big and little endianness are common.) Note that an arbitrary number of padding bits are allowed, so you can't, in full generality, impute the number of decimal digits from the sizeof.
C doesn't support the level of compile time evaluable constant expressions you'd need for this.
So hardcode it and make your code intentionally brittle so that compilation fails if a compiler encounters a case that you have not thought of.
You could solve this in C++ using constexpr and metaprogramming techniques.
((sizeof(int) / 2) * 3 + sizeof(int)) + 2
is the formula I came up with.
The +2 is for the negative sign and the null terminator.
If we suppose that integral values are either 2, 4, or 8 bytes, and if we determine the respective digits to be 5, 10, 20, then a integer constant expression yielding the exact values could be written as follows:
const int digits = (sizeof(int)==8) ? 20 : ((sizeof(int)==4) ? 10 : 5);
int testArray[digits];
I hope that I did not miss something essential. I've tested this at file scope.

understanding Fixed point arithmetic

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.
Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?
Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:
float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));
And similarly for the twiddle factors.
Now I need to perform
res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?
Maybe it's easiest to make an example.
Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).
You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.
A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.
The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).
I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:
You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.
Now, on multiplication.
It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(6, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.
In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.
For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).
So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.
So, instead of
res = a*tw;
you probably want
int64_t res_tmp = (int64_t)a * tw; // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
--res_tmp; // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)
Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.
res = a*tw
a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your ints to a 64-bit signed integer type (e.g. int64_t) and computing their product.
If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:
/*
* Computes a magnitude-preserving fixed-point product of any two signed
* fixed-point numbers with a combined 31 (or fewer) value bits. If x
* is represented as A(s,t) and y is represented as A(u,v),
* where s + t == u + v == 31, then the representation of the result is
* A(s + u + 1, t + v - 32).
*/
int32_t fixed_product(int32_t x, int32_t y) {
int64_t full_product = (int64_t) x * (int64_t) y;
int32_t truncated = full_product / (1U << 31);
int round_up = ((uint32_t) full_product) >> 31;
return truncated + round_up;
}
That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.
a + res
Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.
In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:
int32_t a_plus_res(int32_t a, int32_t res) {
int64_t res16 = ((int64_t) res) * (1 << 3);
int64_t sum16 = a + res16;
int round_up = (((uint32_t) sum16) >> 4) & 1;
return (int32_t) ((sum16 / (1 << 5)) + round_up);
}
A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.
All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.
Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.

Double value as a negative zero

I have a program example:
int main()
{
double x;
x=-0.000000;
if(x<0)
{
printf("x is less");
}
else
{
printf("x is greater");
}
}
Why does the control goes in the first statement - x is less . What is -0.000000?
IEEE 754 defines a standard floating point numbers, which is very commonly used. You can see its structure here:
Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is described by three integers: s = a
sign (zero or one), c = a significand (or 'coefficient'), q = an
exponent. The numerical value of a finite number is
(−1)^s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand is 12345, the exponent is −3,
and the base is 10, then the value of the number is −12.345.
So if the fraction is 0, and the sign is 0, you have +0.0.
And if the fraction is 0, and the sign is 1, you have -0.0.
The numbers have the same value, but they differ in the positive/negative check. This means, for instance, that if:
x = +0.0;
y = -0.0;
Then you should see:
(x -y) == 0
However, for x, the OP's code would go with "x is greater", while for y, it would go with "x is less".
Edit: Artur's answer and Jeffrey Sax's comment to this answer clarify that the difference in the test for x < 0 in the OP's question is actually a compiler optimization, and that actually the test for x < 0 for both positive and negative 0 should always be false.
Negative zero is still zero, so +0 == -0 and -0 < +0 is false. They are two representations of the same value. There are only a few operations for which it makes a difference:
1 / -0 = -infinity, while 1 / +0 = +infinity.
sqrt(-0) = -0, while sqrt(+0) = +0
Negative zero can be created in a few different ways:
Dividing a positive number by -infinity, or a negative number by +infinity.
An operation that produces an underflow on a negative number.
This may seem rather obscure, but there is a good reason for this, mainly to do with making mathematical expressions involving complex numbers consistent. For example, note that the identity 1/√(-z)==-1/√z is not correct unless you define the square root as I did above.
If you want to know more details, try and find William Kahan's Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit in The State of the Art in Numerical Analysis (1987).
Nathan is right but there is one issue though. Usually most of float/double operations are performed by coprocessor. However some compilers try to be clever and instead of letting coprocessor do the comparison (it treats -0.0 and +0.0 the same as 0.0) just assume that since your x variable has minus sign it means that it should be treated as negative and optimize your code.
If you would be able to see how assembly output looks like - I bet you'll only see call to:
printf("x is less");
So it is optimization stuff (bad optimization).
BTW - VC 2008 produces correct output here regardless of optimization level set.
For example - VC optimizes (at full/max optimization level) the code leaving this only:
printf("x is grater");
I like my compiler more every day ;-)

Resources