Explaining of Computing square root of two fixed point fractions - c

I have found that piece of code on blackfin533, which has fract32 which is from -1, 1, its in the format 1.31.
I can't get why the pre-shifting is required for calculating the amplitude of a complex number (re, img). I know if you want to multiply 1.31 by 1.31 fractional format then you need to shift right 31 bits.
GO_coil_D[0].re, and GO_coil_D[0].im are two fract32.
I can't get what the following code is doing :
norm[0] = norm_fr1x32(GO_coil_D[0].re);
norm[1] = norm_fr1x32(GO_coil_D[0].im);
shift = (norm[0] < norm[1]) ? (norm[0] - 1) : (norm[1] - 1);
vectorFundamentalStored.im = shl_fr1x32(GO_coil_D[0].im,shift);
vectorFundamentalStored.re = shl_fr1x32(GO_coil_D[0].re,shift);
vectorFundamentalStored.im = mult_fr1x32x32(vectorFundamentalStored.im, vectorFundamentalStored.im);
vectorFundamentalStored.re = mult_fr1x32x32(vectorFundamentalStored.re, vectorFundamentalStored.re);
amplitudeFundamentalStored = sqrt_fr16(round_fr1x32(add_fr1x32(vectorFundamentalStored.re,vectorFundamentalStored.im))) << 16;
amplitudeFundamentalStored = shr_fr1x32(amplitudeFundamentalStored,shift);
round_fr1x32` (fract32 f1) fract16 Rounds the 32-bit fract to a 16-bit fract using biased rounding.
norm_fr1x32 norm_fr1x32 (fract32) int Returns the number of left shifts required to normalize the input variable so that it is either in the interval 0x40000000 to 0x7fffffff, or in the interval 0x80000000 to 0xc0000000. In other words, fract32 x; shl_fr1x32(x,norm_fr1x32(x)); returns a value in the range 0x40000000 to 0x7fffffff, or in the range 0x80000000 to 0xc0000000

1) If the most significant n bits of the fractional part are all '0' bits, and they are followed by a '1' bit, then n behaves like a floating point binary exponent of value n, and the remaining 31-n bits behave like the mantissa. Squaring the number doubles the number of leading '0' bits to 2*n and reduces the size of the mantissa to 31-2*n bits. This can lead to a loss of precision in the result of the squaring operation.
2) round_fr1x32 converts the 1.31 fraction to a 1.15 fraction, losing up to 16 more bits of precision.
Hopefully you can see that steps 1 and 2 can remove a lot of precision in the number. Pre-scaling the number reduces the number of leading '0' bits n as much as possible, resulting in less precision being lost at step 1. In fact, for one of the two numbers being squared and added, the number of leading '0' bits n will be zero, so squaring that number will still leave up to 31 bits of precision before it is added to the other number. (Step 2 will reduce that precision to 15 bits.)
Lastly, you are wrong about the result of multiplying two 1.31 fraction format numbers - the result needs to be shifted right by 31 bits, not 62 bits.
Worked example:
Let's say the real part is 3/1024 and the imaginary part is 4/1024 in decimal, so the absolute value should be 5/1024 by pythagoras.
With no pre-scaling, the binary fractions are re=0.0000000011₂, im=0.0000000100₂. Squaring them gives re²=0.00000000000000001001₂, im²=0.00000000000000010000₂. Adding the squares gives abs²=0.00000000000000011001₂. Rounding to 15 fractional bits gives abs²=0.000000000000001₂. Taking the square root gives abs=0.000000010110101₂. This differs from the exact result 0.0000000101₂ by 0.000000000010101₂.
When prescaling, both fractions are shifted left by 6 bits, giving sre=0.0011₂, sim=0.0100₂ (I used the prefix 's' to mean 'scaled'). Squaring them gives sre²=0.00001001₂, sim²=0.00010000₂. Adding the squares gives sabs²=0.00011001₂. Rounding to 15 fractional bits does not change the value. Taking the square root gives sabs=0.01010000₂. Converting that to 1.31 format and shifting right by 6 bits gives abs=0.0000000101₂ which is exactly correct (5/1024 in decimal).

Related

Problem of a simple float number serialization example

I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
    ½ → 1734/3467 → 1734
    ⅓ → 1155/3467 → 1155
    0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
    1734 → 1734 ÷ 3467 → 0.500144
    1155 → 1155 ÷ 3467 → 0.333141
    433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...

Understanding casts from integer to float

Could someone explain this weird looking output on a 32 bit machine?
#include <stdio.h>
int main() {
printf("16777217 as float is %.1f\n",(float)16777217);
printf("16777219 as float is %.1f\n",(float)16777219);
return 0;
}
Output
16777217 as float is 16777216.0
16777219 as float is 16777220.0
The weird thing is that 16777217 casts to a lower value and 16777219 casts to a higher value...
In the IEEE-754 basic 32-bit binary floating-point format, all integers from −16,777,216 to +16,777,216 are representable. From 16,777,216 to 33,554,432, only even integers are representable. Then, from 33,554,432 to 67,108,864, only multiples of four are representable. (Since the question does not necessitate discussion of which numbers are representable, I will omit explanation and just take this for granted.)
The most common default rounding mode is to round the exact mathematical result to the nearest representable value and, in case of a tie, to round to the representable value which has zero in the low bit of its significand.
16,777,217 is equidistant between the two representable values 16,777,216 and 16,777,218. These values are represented as 1000000000000000000000002•21 and 1000000000000000000000012•21. The former has 0 in the low bit of its significand, so it is chosen as the result.
16,777,219 is equidistant between the two representable values 16,777,218 and 16,777,220. These values are represented as 1000000000000000000000012•21 and 1000000000000000000000102•21. The latter has 0 in the low bit of its significand, so it is chosen as the result.
You may have heard of the concept of "precision", as in "this fractional representation has 3 digits of precision".
This is very easy to think about in a fixed-point representation. If I have, say, three digits of precision past the decimal, then I can exactly represent 1/2 = 0.5, and I can exactly represent 1/4 = 0.25, and I can exactly represent 1/8 = 0.125, but if I try to represent 1/16, I can not get 0.0625; I will either have to settle for 0.062 or 0.063.
But that's for fixed-point. The computer you're using uses floating-point, which is a lot like scientific notation. You get a certain number of significant digits total, not just digits to the right of the decimal point. For example, if you have 3 decimal digits worth of precision in a floating-point format, you can represent 0.123 but not 0.1234, and you can represent 0.0123 and 0.00123, but not 0.01234 or 0.001234. And if you have digits to the left of the decimal point, those take away away from the number you can use to the right of the decimal point. You can use 1.23 but not 1.234, and 12.3 but not 12.34, and 123.0 but not 123.4 or 123.anythingelse.
And -- you can probably see the pattern by now -- if you're using a floating-point format with only three significant digits, you can't represent all numbers greater than 999 perfectly accurately at all, even though they don't have a fractional part. You can represent 1230 but not 1234, and 12300 but not 12340.
So that's decimal floating-point formats. Your computer, on the other hand, uses a binary floating-point format, which ends up being somewhat trickier to think about. We don't have an exact number of decimal digits' worth of precision, and the numbers that can't be exactly represented don't end up being nice even multiples of 10 or 100.
In particular, type float on most machines has 24 binary bits worth of precision, which works out to 6-7 decimal digits' worth of precision. That's obviously not enough for numbers like 16777217.
So where did the numbers 16777216 and 16777220 come from? As Eric Postpischil has already explained, it ends up being because they're multiples of 2. If we look at the binary representations of nearby numbers, the pattern becomes clear:
16777208 111111111111111111111000
16777209 111111111111111111111001
16777210 111111111111111111111010
16777211 111111111111111111111011
16777212 111111111111111111111100
16777213 111111111111111111111101
16777214 111111111111111111111110
16777215 111111111111111111111111
16777216 1000000000000000000000000
16777218 1000000000000000000000010
16777220 1000000000000000000000100
16777215 is the biggest number that can be represented exactly in 24 bits. After that, you can represent only even numbers, because the low-order bit is the 25th, and essentially has to be 0.
Type float cannot hold that much significance. The significand can only hold 24 bits. Of those 23 are stored and the 24th is 1 and not stored, because the significand is normalised.
Please read this which says "Integers in [ − 16777216 , 16777216 ] can be exactly represented", but yours are out of that range.
Floating representation follows a method similar to what we use in everyday life and we call exponential representation. This is a number using a number of digits that we decide will suffice to realistically represent the value, we call it mantissa, or significant, that we will multiply to a base, or radix, value elevated to a power that we call exponent. In plain words:
num*base^exp
We generally use 10 as base, because we have 10 finger in our hands, so we are habit to numbers like 1e2, which is 100=1*10^2.
Of course we regret to use exponential representation for so small numbers, but we prefer to use it when acting on very large numbers, or, better, when our number has a number of digits that we consider enough to represent the entity we are valorizing.
The correct number of digits could be how many we can handle by mind, or what are required for an engineering application. When we decided how many digits we need we will not care anymore for how adherent to the real value will be the numeric representation we are going to handle. I.e. for a number like 123456.789e5 it is understood that adding up 99 unit we can tolerate the rounded representation and consider it acceptable anyway, if not we should change the representation and use a different one with appropriate number of digits as in 12345678900.
On a computer when you have to handle very large numbers, that couldn't fit in a standard integer, or when the you have to represent a real number (with decimal part) the right choice is a floating or double floating point representation. It uses the same layout we discussed above, but the base is 2 instead of 10. This because a computer can have only 2 fingers, the states 0 or 1. Se the formula we used before, to represent 100, become:
100100*2^0
That's still isn't the real floating point representation, but gives the idea. Now consider that in a computer the floating point format is standardized and for a standard float, as per IEE-754, it uses, as memory layout (we will see after why it is assumed 1 more bit for the mantissa), 23bits for the mantissa, 1bit for the sign and 8bits for the exponent biased by -127 (that simply means that it will range between -126 and +127 without the need for a sign bit, and the values 0x00 and 0xff reserved for special meaning).
Now consider using 0 as exponent, this means that the value 2^exponent=2^0=1 multiplied by mantissa give the same behavior of a 23bits integer. This imply that incrementing a count as in:
float f = 0;
while(1)
{
f +=1;
printf ("%f\n", f);
}
You will see that the printed value linearly increase by one until it saturates the 23bits and the exponent will become to grow.
If the base, or radix, of our floating point number would have been 10, we would see an increase each 10 loops for the first 100 (10^2) values, than an increase of 100 for the next 1000 (10^3) values and so on. You see that this corresponds to the *truncation** we have to make due to the limited number of available digits.
The same phenomenon will be observed when using the binary base, only the changes happens on powers of 2 interval.
What we discussed up to now is called the denormalized form of a floating point, what is normally used is the counterpart normalized. The latter simply means that there is a 24th bit, not stored, that is always 1. In plane words we wouldn't use an exponent of 0 for number less that 2^24, but we shift it (multiply by 2) up to the MSbit==1 reach the 24th bit, than the exponent is adjusted to such a negative value that force the conversion to shift back the number to its original value.
Remember the reserved value of the exponent we talked above? Well an exponent==0x00 means that we have a denormalized number. exponent==0xff indicate a nan (not-a-number) or +/-infinity if mantissa==0.
It should be clear now that when the number we express is beyond the 24bits of the significant (mantissa), we should expect approximation of the real value depending on how much far we are from 2^24.
Now the number you are using are just on the edge of 2^24=16,277,216 :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| = 16,277,215
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\______ _______/\_____________________ _______________________/
i v v
g exponent mantissa
n
Now increasing by 1 we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| = 16,277,216
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Note that we have triggered to 1 the 24th bit, but from now on we are above the 24 bit representation, and each possible further representation is in steps of 2^1=2. Simply advance by 2 or can represent only even numbers (multiples of 2^1=2). I.e. setting to 1 the Less Significant bit we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| = 16,277,218
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Increasing again:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0| = 16,277,220
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
As you can see we cannot exactly represent 16,277,219. In your code:
// This will print 16777216, because 1 increment isn't enough to
// increase the significant that can express only intervals
// that are > 2^1
printf("16777217 as float is %.1f\n",(float)16777217);
// This will print 16777220, because an increment of 3 on
// the base 16777216=2^24 will trigger an exponent increase rounded
// to the closer exact representation
printf("16777219 as float is %.1f\n",(float)16777219);
As said above the choice of the numeric format must be appropriate for the usage, a floating point is only an approximate representation of a real number, and is definitively our duty to carefully use the right type.
In the case if we need more precision we could use a double, or an integer long long int.
Just for sake of completeness I would add few words on the approximate representation for irriducible numbers. This numbers are not divisible by a fraction of 2, so the representation in float format will always be not exact, and need to be rounded to the correct value during conversion to decimal representation.
For more details see:
https://en.wikipedia.org/wiki/IEEE_754
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
Online demo applets:
https://babbage.cs.qc.cuny.edu/IEEE-754/
https://evanw.github.io/float-toy/
https://www.h-schmidt.net/FloatConverter/IEEE754.html

How are results rounded in floating-point arithmetic?

I wrote this code that simply sums a list of n numbers, to practice with floating point arithmetic, and I don't understand this:
I am working with float, this means I have 7 digits of precision, therefore, if I do the operation 10002*10002=100040004, the result in data type float will be 100040000.000000, since I lost any digit beyond the 7th (the program still knows the exponent, as seen here).
If the input in this program is
3
10000
10001
10002
You will see that, however, when this program computes 30003*30003=900180009 we have 30003*30003=900180032.000000
I understand this 32 appears becasue I am working with float, and my goal is not to make the program more precise but understand why this is happening. Why is it 900180032.000000 and not 900180000.000000? Why does this decimal noise (32) appear in 30003*30003 and not in 10002*10002 even when the magnitude of the numbers are the same? Thank you for your time.
#include <stdio.h>
#include <math.h>
#define MAX_SIZE 200
int main()
{
int numbers[MAX_SIZE];
int i, N;
float sum=0;
float sumb=0;
float sumc=0;
printf("introduce n" );
scanf("%d", &N);
printf("write %d numbers:\n", N);
for(i=0; i<N; i++)
{
scanf("%d", &numbers[i]);
}
int r=0;
while (r<N){
sum=sum+numbers[r];
sumb=sumb+(numbers[r]*numbers[r]);
printf("sum is %f\n",sum);
printf("sumb is %f\n",sumb);
r++;
}
sumc=(sum*sum);
printf("sumc is %f\n",sumc);
}
As explained below, the computed result of multiplying 10,002 by 10,002 must be a multiple of eight, and the computed result of multiplying 30,003 by 30,003 must be a multiple of 64, due to the magnitudes of the numbers and the number of bits available for representing them. Although your question asks about “decimal noise,” there are no decimal digits involved here. The results are entirely due to rounding to multiples of powers of two. (Your C implementation appears to use the common IEEE 754 format for binary floating-point.)
When you multiply 10,002 by 10,002, the computed result must be a multiple of eight. I will explain why below. The mathematical result is 100,040,004. The nearest multiples of eight are 100,040,000 and 100,040,008. They are equally far from the exact result, and the rule used to break ties chooses the even multiple (100,040,000 is eight times 12,505,000, an even number, while 100,040,008 is eight times 12,505,001, an odd number).
Many C implementations use IEEE 754 32-bit basic binary floating-point for float. In this format, a number is represented as an integer M multiplied by a power of two 2e. The integer M must be less than 224 in magnitude. The exponent e may be from −149 to 104. These limits come from the numbers of bits used to represent the integer and the exponent.
So all float values in this format have the value M • 2e for some M and some e. There are no decimal digits in the format, just an integer multiplied by a power of two.
Consider the number 100,040,004. The biggest M we can use is 16,777,215 (224−1). That is not big enough that we can write 100,040,004 as M • 20. So we must increase the exponent. Even with 22, the biggest we can get is 16,777,215 • 22 = 67,108,860. So we must use 23. And that is why the computed result must be a multiple of eight, in this case.
So, to produce a result for 10,002•10,002 in float, the computer uses 12,505,000 • 23, which is 100,040,000.
In 30,003•30,003, the result must be a multiple of 64. The exact result is 900,180,009. 25 is not enough because 16,777,215•25 is 536,870,880. So we need 26, which is 64. The two nearest multiples of 64 are 900,179,968 and 900,180,032. In this case, the latter is closer (23 away versus 41 away), so it is chosen.
(While I have described the format as an integer times a power of two, it can also be described as a binary numeral with one binary digit before the radix point and 23 binary digits after it, with the exponent range adjusted to compensate. These are mathematically equivalent. The IEEE 754 standard uses the latter description. Textbooks may use the former description because it makes analyzing some of the numerical properties easier.)
Floating point arithmetic is done in binary, not in decimal.
Floats actually have 24 binary bits of precision, 1 of which is a sign bit and 23 of which are called significand bits. This converts to approximately 7 decimal digits of precision.
The number you're looking at, 900180032, is already 9 digits long and so it makes sense that the last two digits (the 32) might be wrong. The rounding like the arithmetic is done in binary, the reason for the difference in rounding can only be seen if you break things down into binary.
900180032 = 110101101001111010100001000000
900180000 = 110101101001111010100000100000
If you count from the first 1 to the last 1 in each of those numbers (the part I put in bold), that is how many significand bits it takes to store the number. 900180032 takes only 23 significand bits to store while 900180000 takes 24 significand bits which makes 900180000 an impossible number to store as floats only have 23 significand bits. 900180032 is the closest number to the correct answer, 900180009, that a float can store.
In the other example
100040000 = 101111101100111110101000000
100040004 = 101111101100111110101000100
The correct answer, 100040004 has 25 significand bits, too much for floats. The nearest number that has 23 or less significand bits is 10004000 which only has 21 significant bits.
For more on floating point arithmetic works, try here http://steve.hollasch.net/cgindex/coding/ieeefloat.html

Convert Fraction Part to Binary

I have a fraction part of a float number (5 of -10.5, for instance). It is a character since I extracted it from the input. How can I convert that character into a binary fraction part of that float number? (I am building a float input -> HEX output in 32bit IEEE784, I have already extracted binary representations of the sign, exponent and integer part of mantissa.)
I have been thinking about implement the algorithm of multiplying the fraction by two and taking the remainder, then repeating till it fill the mantissa, but I am not allowed to use any floating point operations in an assignment.
Examples:
User inputs -10.5. The program needs to take the fraction of the number (which is 5) and convert it to binary format (which is 1 (.1))
EDIT: I am limited by 16 bit register size thus I need a solution to operate on numbers > 2^16-1.
You basically have the right idea. Just double the decimal part and the carry becomes the number. Let's try converting .7 to binary as an example (.5 is too simple)
NUMBER REMAINDER CARRY
.7 --- .
1.4 .4 1
.8 .8 0
1.6 .6 1
1.2 .2 1
.4 .4 0
.8 .8 0
1.6 .6 1
1.2 .2 1
&ct.
So .7 in binary is .101100110.... 10.7 would just be 1010.101100110...
You can stop converting to binary once the remainder becomes 0, or once the total representation becomes 24 bits long due to the limits of IEEE float precision.
Note that the decimal itself is purely decorative, you aren't using floating point here.
NUMBER %10 /10
7 --- .
14 4 1
8 8 0
16 6 1
12 2 1
4 4 0
8 8 0
16 6 1
12 2 1
&ct.
In the case of .07 you would represent the remainder and carry by %100 and /100 respectively. In the case of .007 you would use %1000 and /1000.
In an IEEE-754 sense, there is no well-defined meaning for the fractional part of a floating-point number outside the context of the overall number. The bits that distinguish the value 10.5 from the value 10.0 are a smaller subset of the overall representation than are the bits that distinguish the value 0.0 from the value 0.5. When one talks about the "fraction" part of an IEEE-754 numeric representation, one normally is talking about all bits of the mantissa; for IEEE 754 single-precision binary format that's 24 bits, including one implied bit, regardless of the magnitude of the number.
UPDATE / REVISION:
Based on updates clarifying the question, I have replaced the bulk of my previous answer, which addressed the problem of programmatically extracting the bits of a float or double, as opposed to converting a decimal-format text representation to the bits of an IEEE-754 format representation. Parsing text is a tricky problem, as I also said in the original version of this answer.
I do nevertheless reiterate that I think it a mistake to separate the mantissa into separate pieces. It offers no particular advantage that I can see, but it introduces the problem of properly recombining the pieces later.
Here's a workable approach to the overall problem:
Determine the sign of the input number based on the presence or absence of a negative sign.
Form a bignum representation of the absolute value of the input number. For example, a base-1,000,000,000 representation, stored in an array of uint32_t. As long as the base is a power of 10, it is straightforward to parse your decimal-format input number into such a representation, and no rounding error is introduced by doing so.
Scale the bignum representation by the appropriate power of two to obtain a scaled value between 223 and 224 - 1. This can be done with only integer arithmetic. Do note that it may be possible that you'll have to scale down instead of up. Either way, this scaling can be performed via integer arithmetic alone. Remember which power of two (s) was used as the scale factor (i.e. the base-2 logarithm of the scale factor, not so much the scale factor itself; this can come easily out of the algorithm with no computation of transcendental functions required).
Round to unit precision.
You then have the sign bit from step 1, the bits of the mantissa / significand as the remaining significant bits of the bignum (remember that the most-significant one is implicit, not explicit, in the final IEEE format), and the exponent as 23 - s.
Note, too, that that approach can be extended very naturally to scientific notation.

Why does a C floating-point type modify the actual input of 125.1 to 125.099998 on output?

I wrote the following program:
#include<stdio.h>
int main(void)
{
float f;
printf("\nInput a floating-point no.: ");
scanf("%f",&f);
printf("\nOutput: %f\n",f);
return 0;
}
I am on Ubuntu and used GCC to compile the above program. Here is my sample run and output I want to inquire about:
Input a floating-point no.: 125.1
Output: 125.099998
Why does the precision change?
Because the number 125.1 is impossible to represent exactly with floating-point numbers. This happens in most programming languages. Use e.g. printf("%.1f", f); if you want to print the number with one decimal, but be warned: the number itself is not exactly equal to 125.1.
Thank you all for your answers. Although almost all of you helped me look in the right direction I could not understand the exact reason for this behavior. So I did a bit of research in addition to reading the pages you guys pointed me to. Here is my understanding for this behavior:
Single Precision Floating Point numbers typically use 4 bytes for storage on x86/x86-64 architectures. However not all 32 bits (4 bytes = 32 bits) are used to store the magnitude of the number.
For storing as a single precision floating type, the input stream is formatted in the following notation (somewhat similar to scientific notation):
(-1)^s x 1.m x 2^(e-127), where
s = sign of the number, range:{0,1} - takes up 1 bit
m = mantissa (fractional portion) of the number - takes up 23 bits
e = exponent of the number offset by 127, range:{0,..,255} - takes up 8 bits
and then stored in memory as
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
Therefore the decimal number 125.1 is first converted to binary form but limited to 24 bits so that the mantissa is represented by no more than 23 bits. After conversion to binary form:
125.1 = 1111101.00011001100110011
NOTE: 0.1 in decimal can be represented up to infinite bits in binary but the computer limits the representation to 17 bits so the complete representation does not exceed 24 bits.
Now converting it into the specified notation we get:
125.1 = 1.111101 00011001100110011 x 2^6
= (-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
which implies
s = 0
m = 11110100011001100110011
e = 133 = 10000101
Therefore, 125.1 will be stored in memory as:
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
00110011 00110011 11111010 01000010
On being passed to the printf() function the output stream is generated by converting the binary form to the decimal form. The bytes are actually stored in reverse order (from the input stream) and hence read in this order:
3rd byte 2nd byte 1st byte 0th byte
seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
01000010 11111010 00110011 00110011
Next, it is converted into the specific notation for conversion
(-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
On simplifying the above representation further:
= 1.111101 00011001100110011 x 2^6
= 1111101.00011001100110011
And finally converting it to decimal:
= 125.0999984741210938
but single precision floating point can represent only up to 6 decimal places, therefore the answer is rounded off to 125.099998.
Think about a fixed point representation first.
2^3=8 2^2=4 2^1=2 2^0=1 2^-1=1/2 2^-2=1/4 2^-3=1/8 2^-4=1/16
If we want to represent a fraction then we set the bits to the right of the point, so 5.5 is represented as 01011000.
But if we want to represent 5.6, there is not an exact fractional representation. The closest we can get is 01011001 == 5.5625
1/2 + 1/16 = 0.5625
2^-4 + 2^-1
Because its the closest representation of 125.1 , remember that single precision floating point are just 32 bits.
If I tell you to write 1/3 as decimal number down, you realize there a numbers which have no finite representation. .1 is the exact representation of 1/10 there this problem does not appear, BUT this is just in decimal representation. In binary representation .1 is one of those numbers that require infinite digits. As your number must be somehwere cut there is something lost.
No floating point numbers has an exact representation, they all have limited accuracy. When converting from a number in text to a float (with scanf or otherwise), you're in another world with different kinds of numbers, and precision may be lost. Same thing goes when converting from a float to a string: you decide on how many digits you want. You can't know "how many digits there are" in a float before converting to text or another format that can keep that information. This all has to do with how floats are stored:
significant_digits * baseexponent
The normal type used for floating point in C is double, not float. Your float is implicitly cast to a double, and because the float is less precise, the difference to the closest representable number to 125.1 is more apparent (and printf's default precision is tailored for use with doubles). Try this instead:
#include<stdio.h>
int main(void)
{
double f;
printf("\nInput a floating-point no.: ");
scanf("%lf",&f);
printf("\nOutput: %f\n",f);
return 0;
}

Resources