Square root union and Bit Shift - c

I found this code with which the square root is obtained what surprises me is the way it does, using a union and bit shifts this is the code:
float sqrt3(const float x)
{
union
{
int i;
float x;
} u;
u.x = x;
u.i = (1<<29) + (u.i >> 1) - (1<<22);
return u.x;
}
first is saved in u.x the value of x and then assign a value to u.i then the square root of the number and appears magically u.x
¿someone explain to me how this algorithm?

The above code exhibits UB (undefined behaviour), so it should not be trusted to work on any platform. This is because it writes to a member of a union and reads back from a member different from that it last used to write the union with. It also depends heavily on endianness (the ordering of the bytes within a multi-byte integer).
However, it generally will do what is expected, and to understand why it is worthwhile for you to read about the IEEE 754 binary32 floating-point format.
Crash Course in IEEE754 binary32 format
IEEE754 commonly divides a 32-bit float into 1 sign bit, 8 exponent bits and 23 mantissa bits, thus giving
31 30-23 22-0
Bit#: ||------||---------------------|
Bit Representation: seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
Value: sign * 1.mantissa * pow(2, exponent-127)
With the number essentially being in "scientific notation, base 2".
As a detail, the exponent is stored in a "biased" form (that is, it has a value 127 units too high). This is why we subtract 127 from the encoded exponent to get the "real" exponent.
Short Explanation
What your code does is it halves the exponent portion and damages the mantissa. This is done because the square root of a number has an exponent roughly half in magnitude.
Example in base 10
Assume we want the square root of 4000000 = 4*10^6.
4000000 ~ 4*10^6 <- Exponent is 6
4000 ~ 4*10^3 <- Divide exponent in half
Just by dividing the exponent 6 by 2, getting 3, and making it the new exponent we are already within the right order of magnitude, and much closer to the truth,
2000 = sqrt(4000000)
.

You can find a perfect explanation on wikipedia:
Methods of computing square roots
see section: Approximations that depend on the floating point representation
So for a 32-bit single precision floating point number in IEEE format
(where notably, the power has a bias of 127 added for the represented
form) you can get the approximate logarithm by interpreting its binary
representation as a 32-bit integer, scaling it by 2^{-23}, and
removing a bias of 127, i.e.
To get the square root, divide the logarithm by 2 and convert the value back.

Related

If I have a float that f = 50,000, and then i do f*f, is the value returned a negative?

So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.
Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.

How is the range of long double in C calculated?

While studying C I came to know that range of long double is 3.4E-4932 to 1.1E+4932. What is E here ? Size of long double in 10 bytes. If I assume E is 10 then how long double stores numbers till 19 places after decimal.
3.4E-4932 means . Both floats and doubles are stored in a format that keeps the exponent and the mantissa separate. In your example, -4392 will be encoded in the exponent, and 3.4 will be encoded in the mantissa, both as binary numbers.
Note that IEEE floating point formats come in a variety of ranges with availability that varies by platform. Refer to IEEE floating point for more details. As pointed out by Joe Farrell, your range is probably x86 Extended Precision Format. That format carries 1 bit for sign (s), 15 bits of binary exponent (e) with a bias of -16383, and 1 + 63 bits of binary mantissa (m). For normalized numbers, the value is computed as .
The smallest positive normalized number in this format has a sign bit of 0, an exponent of 1, and a mantissa of 1.0, corresponding to or . In binary, that number looks like:
The range of a long double (or, indeed, any floating point width) on Intel hardware is typically [-∞, ∞]. Between those endpoints many finite numbers are also representable:
0
±m×2e, where:
m is an integer between 1 and 264-1, and
e is an integer between -16445 and 16320
That means that the smallest non-zero long double is 2-16445 and the largest finite long double is (264-1)·216320 (or 216384-216320), which are approximately equal to the decimal numbers in scientific notation in the question.
See this Wikipedia article for details on the representation (which is binary, not decimal).

Calculating range of float in C

Title pretty much sums it all.
I know that floats are 32bit total with 23bits for mantissa and 8bits for the exponent value and 1 for signing.
Calculating the range of "int" is pretty simple: 32bits = 32-1bit signature =31bits ==> Range is therefore 2³¹= 2.14e9
The formula makes sense...
Now i've looked around stackoverflow but all the answers i've found regarding float range calculations lacked substance. Just a bunch of numbers appearing randomly in the responses and magically reaching the 3.4e38 conclusion.
I'm looking for an answer from someone with real knowledge of subject. Someone that can explain through the use of a formula how this range is calculated.
Thank you all.
Mo.
C does not define float as described by OP. The one suggested by OP: binary32, the most popular, is one of many conforming formats.
What C does define
5.2.4.2.2 Characteristics of floating types
s sign (±1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum emin and a maximum emax)
p precision (the number of base-b digits in the significand)
fk nonnegative integers less than b (the significand digits)
x = s*power(b,e)*Σ(k=1, p, f[k]*power(b,-k))
For binary32, the max value is
x = (+1)*power(2, 128)*(0.1111111111 1111111111 1111 binary)
x = 3.402...e+38
Given 32-bits to define a float many other possibilities occur. Example: A float could exist just like binary32, yet not support infinity/not-a-number. The leaves another exponent available numbers. The max value is then 2*3.402...e+38.
binary32 describes its significand ranging up to 1.11111... binary. The C characteristic formula above ranges up to 0.111111...
C uses single-precision floating point notation, which means that a 32-bit float has 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The mantissa is calculated by summing each mantissa bit * 2^(- (bit_index)). The exponent is calculated by converting the 8 bit binary number to a decimal and subtracting 127 (thus you can have negative exponents as well), and the sign bit indicates whether or not is negative. The formula is thus:
(-1)^S * 1.M * 2^(E - 127)
Where S is the sign, M is the mantissa, and E is the exponent. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for a better mathematical explanation.
To explicitly answer your question, that means for a 32 bit float, the largest value is (-1)^0 * 1.99999988079071044921875 * 2^128, which is 6.8056469327705771962340836696903385088 × 10^38 according to Wolfram. The smallest value is the negative of that.

C, getting the maximum float or maximum double not from <float.h>

i was completing book "C. Programming language", but faced up with the question in which i should get the maximum\minimum value of float-pointing number, without using any of standard libraries, such as <float.h>. Thank you
“Without using” exercises are a little bit stupid, so here is one version “without using” any header.
…
double nextafter(double, double);
double max = nextafter(1.0 / 0.0, 0.0);
…
And without using any library function, only assuming that double is mapped to IEEE 754's binary64 format (a very common choice):
…
double max = 0x1.fffffffffffffp1023;
…
Assuming a binary floating-point format, start with 2.0 and multiply it by 2.0 until you get an overflow. This determines the maximum exponent. Then, starting with x as the number you had right before the overflow, take the sum x + x/2 + x/4 + ... until adding x/q does not change the value of the number (or overflows again). This determines the maximum mantissa.
The smallest representable positive number can be found a similar way.
From wikipedia you can read up the IEEE floating point format: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This contains
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
The page also contains information on how to interpret the exponent value. Value of 0xFF (255) in exponent signifies ±infinity if the significant is zero and NaN (not a number) otherwise. The +- infinity are largest numbers. The sign bit defines if the number if +infinity or -infinity. If the question is about the largest non-infinite value then just use the largest non-special value.
Largest non-infinite value is 24 bits of 1s in significand and 0xFE (254) as exponent. Since the exponent is offset the actual value is something like: significand * 2^(254-127) which is somewhere close to 3.402823 × 10^38 in decimal according to the wikipedia page. If you want the minimum, just toggle the sign bit on to get the exact same value as negative.
EDIT: Since this is about C, I've assumed the 32 bit IEEE float.
You can figure out the number of bits the number holds by doing a sizeof(type)*8.
Then look at http://en.wikipedia.org/wiki/Double-precision_floating-point_format or http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This way you can look it up in a table based in the number of bits.
This assumes that the structure is using IEEE 754.
You could start from the IEEE definition, and work from there. For example, number of bits of exponent, number of bits of mantissa. When you study the format, you will see that the 23 bits of mantissa actually represent 24 bits. The reason is, the mantissa is "normalised", that is, it is left shifted so that the ms bit is always 1. This gives the maximum number of significant bits retained from a calculation. Where has the 24th bit gone? Because it is always there (except for a 0 value), it is "implied" as the 24th bit.

problems in floating point comparison [duplicate]

This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
void main()
{
float f = 0.98;
if(f <= 0.98)
printf("hi");
else
printf("hello");
getch();
}
I am getting this problem here.On using different floating point values of f i am getting different results.
Why this is happening?
f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.
The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.
Use
if(f <= 0.98f)
or use a double for f instead.
In detail... assuming float is IEEE single-precision and double is IEEE double-precision.
These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:
0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...
A float can only store 24 bits of significant figures, i.e.
0.111110101110000101000111_101...
^ round off here
= 0.111110101110000101001000
= 16441672 / 2^24
= 0.98000001907...
A double can store 53 bits of signficant figures, so
0.11111010111000010100011110101110000101000111101011100_00101000...
^ round off here
= 0.11111010111000010100011110101110000101000111101011100
= 8827055269646172 / 2^53
= 0.97999999999999998224...
So the 0.98 will become slightly larger in float and smaller in double.
It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.
Read more about this at http://en.wikipedia.org/wiki/Floating_point
An example (from encountering this problem in my VB6 days)
To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.
Bit 1 is the sign bit (is it negative [1] or position [0])
Bits 2-9 are for the exponent value
Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )
So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):
s --exp--- -------mantissa--------
0 01111111 00011001100110011001100
If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.
sign = 0 = a positive number
exponent: 01111111 = 127
mantissa: 00011001100110011001100 = 838860
With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).
To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).
So here is the math:
(1 + 099999904632568359375) * 2^(127-127)
1.099999904632568359375 * 1 = 1.099999904632568359375
As you can see 1.1 is not really stored in the single floating point value as 1.1.

Resources