I want to calculate the square root of some integer without using any floating point arithmetic. The catch, however, is that I don't want to discard precision from the output. That is to say, I do not want a rounded integer as the result, I would like to achieve the decimal point value as well, at least to two significant digits. As an example:
sqrt(9) = 3
sqrt(10) = 3.16
sqrt(999999) = 999.99
I've been thinking about it but I haven't particularly come up with solutions, nor has searching helped much since most similar questions are just that, only similar.
Output is acceptable in any form which is not a floating point number and accurately represents the data. Preferably, I would have two ints, one for the portion before the decimal and one for the portion after the decimal.
I'm okay with just pseudo-code / an explained algorithm, if coding C would be best. Thanks
You can calculate an integer numerator and an integer denominator, such that the floating-point division of the numerator by the denominator will yield the square root of the input number.
Please note, however, that no square-root method exists such that the result is 100% accurate for every natural number, as the square root of such number can be an irrational number.
Here is the algorithm:
Function (input number, input num_of_iterations, output root):
Set root.numerator = number
Set root.denominator = 1
Run num_of_iterations:
Set root = root-(root^2-number)/(root*2)
You might find this C++ implementation useful (it also includes the conversion of the numerator divided by the denominator into a numerical string with predefined floating-point precision).
Please note that no floating-point operations are required (as demonstrated at the given link).
Related
There are many questions (and answers) on this subject, but I am too thick to figure it out. In C, for a floating point of a given type, say double:
double x;
scanf("%lf", &x);
Is there a generic way to calculate an upper bound (as small as possible) for the error between the decimal fraction string passed to scanf and the internal representation of what is now in x?
If I understand correctly, there is sometimes going to be an error, and it will increase as the absolute value of the decimal fraction increases (in other words, 0.1 will be a bit off, but 100000000.1 will be off by much more).
This aspect of the C standard is slightly under-specified, but you can expect the conversion from decimal to double to be within one Unit in the Last Place of the original.
You seem to be looking for a bound on the absolute error of the conversion. With the above assumption, you can compute such a bound as a double as DBL_EPSILON * x. DBL_EPSILON is typically 2^-52.
A tighter bound on the error that can have been made during the conversion can be computed as follows:
double va = fabs(x);
double error = nextafter(va, +0./0.) - va;
The best conversion functions guarantee conversion to half a ULP in default round-to-nearest mode. If you are using conversion functions with this guarantee, you can divide the bound I offer by two.
The above applies when the original number represented in decimal is 0 or when its absolute value is comprised between DBL_MIN (approx. 2*10^-308) and DBL_MAX (approx. 2*10^308). If the non-null decimal number's absolute value is lower than DBL_MIN, then the absolute error is only bounded by DBL_MIN * DBL_EPSILON. If the absolute value is higher than DBL_MAX, you are likely to get infinity as the result of the conversion.
you cant think of this in terms of base 10, the error is in base 2, which wont necessarily point to a specific decimal place in base 10.
You have two underlying issues with your question, first scanf taking an ascii string and converting it to a binary number, that is one piece of software which uses a number of C libraries. I have seen for example compile time parsing vs runtime parsing give different conversion results on the same system. so in terms of error, if you want an exact number convert it yourself and place that binary number in the register/variable, otherwise accept what you get with the conversion and understand there may be rounding or clipping on the conversion that you didnt expect (which results in an accuracy issue, you didnt get the number you expected).
the second and real problem Pascal already answered. you only have x number if binary places. In terms of decimal if you had 3 decimal places the number 1.2345 would either have to be represented as 1.234 or 1.235. same for binary if you have 3 bits of mantissa then 1.0011 is either 1.001 or 1.010 depending on rounding. the mantissa length for IEEE floating point numbers is well documented you can simply google to find how many binary places you have for each precision.
What I thought was a trivial addition in standard C code compiled by GCC has confused me somewhat.
If I have a double called A and also a double called B, and A = a very small exponential say 1e-20 and B is a larger value for example 1e-5 - why does my double C which equals the summation A+B take on the dominant value B? I was hoping that when I specify to print to 25 decimal places I would get 1.00000000000000100000e-5.
Instead what I get is just 1.00000000000000000000e-5. Do I have to use long double or something else?
Very confused, and an easy question for most to answer I'm sure! Thanks for any guidance in advance.
Yes, there is not enough precision in the double mantissa. 2^53 (the precision of the double mantissa) is only slightly larger than 10^15 (the ratio between 10^20 and 10^5) so binary expansion and round off can easily squash small bits at the end.
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Google is your friend etc.
Floating point variables can hold a bigger range of value than fixed point, however their precision on significant digit has limits.
You can represent very big or very small numbers but the precision is dependent on the number of significant digit.
If you try to make operation between numbers very far in terms of exponent used to express them, the ability to work with them depends on the ability to represent them with the same exponent.
In your case when you try to sum the two numbers, the smaller numbers is matched in exponent with the bigger one, resulting in a 0 because its significant digit is out of range.
You can learn more for example on wiki
I don't want to unnecessarily re-invent the wheel, but I have been looking for the functionality of strtod but with a base parameter (2,8,10,16). (I know strtoul allows a base parameter but I'm looking for return type double). Any advice / pointers in the right direction? Thanks.
For arbitrary base, this is a hard problem, but as long as your base is a power of two, the plain naive algorithm will work just fine.
strtod (in C99) supports hex floats in the same format as the C language's hex float constants. 0x prefix is required, p separates the exponent, and the exponent is in base 10 and represents a power of 2. If you need to support pre-C99 libraries, you'll have no such luck. But since you need base 2/4/8 too, it's probably just best to roll your own anyway.
Edit: An outline of the naive algorithm:
Start with a floating point accumulator variable (double or whatever, as you prefer) initialized to 0.
Starting from the leftmost digit, and up to the radix point, for each character you process, multiply the accumulator by the base and add the value of the character as a digit.
After the radix point, start a new running place-value variable, initially 1/base. On each character you process, add the digit value times the place-value variable, and then divide the place-value variable by base.
If you see the exponent character, read the number following it as an integer and use one of the standard library functions to scale a floating point number by a power of 2.
If you want to handle potentially rounding up forms that have too many digits, you have to work out that logic once you exceed the number of significant places in step 2 or 3. Otherwise you can ignore that.
Unlikely - I have never seen floating point numbers coded as 'decimals' in other number bases.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Dealing with accuracy problems in floating-point numbers
I was quite surprised why I tried to multiply a float in C (with GCC 3.2) and that it did not do as I expected.. As a sample:
int main() {
float nb = 3.11f;
nb *= 10;
printf("%f\n", nb);
}
Displays: 31.099998
I am curious regarding the way floats are implemented and why it produces this unexpected behavior?
First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision.
This is actually an expected behavior. floats are implemented using binary representation which means they can't accurately represent decimal values.
See MSDN for more information.
You can also see in the description of float that it has 6-7 significant digits accuracy. In your example if you round 31.099998 to 7 significant digits you will get 31.1 so it still works as expected here.
double type would of course be more accurate, but still has rounding error due to it's binary representation while the number you wrote is decimal.
If you want complete accuracy for decimal numbers, you should use a decimal type. This type exists in languages like C#. http://msdn.microsoft.com/en-us/library/system.decimal.aspx
You can also use rational numbers representation. Using two integers which will give you complete accuracy as long as you can represent the number as a division of two integers.
This is working as expected. Computers have finite precision, because they're trying to compute floating point values from integers. This leads to floating point inaccuracies.
The Floating point wikipedia page goes into far more detail on the representation and resulting accuracy problems than I could here :)
Interesting real-world side-note: this is partly why a lot of money calculations are done using integers (cents) - don't let the computer lose money with lack of precision! I want my $0.00001!
The number 3.11 cannot be represented in binary. The closest you can get with 24 significant bits is 11.0001110000101000111101, which works out to 3.1099998950958251953125 in decimal.
If your number 3.11 is supposed to represent a monetary amount, then you need to use a decimal representation.
In the Python communities we often see people surprised at this, so there are well-tested-and-debugged FAQs and tutorial sections on the issue (of course they're phrased in terms of Python, not C, but since Python delegates float arithmetic to the underlying C and hardware anyway, all the descriptions of float's mechanics still apply).
It's not the multiplication's fault, of course -- remove the statement where you multiply nb and you'll see similar issues anyway.
From Wikipedia article:
The fact that floating-point numbers
cannot precisely represent all real
numbers, and that floating-point
operations cannot precisely represent
true arithmetic operations, leads to
many surprising situations. This is
related to the finite precision with
which computers generally represent
numbers.
Floating points are not precise because they use base 2 (because it's binary: either 0 or 1) instead of base 10. And base 2 converting to base 10, as many have stated before, will cause rounding precision issues.
The question I meant to ask concerned the mantissa, not the exponent, and has lots to do with the question I asked earlier in the week regarding "missing" digits on the sum of two negative floats.
Given that the mantissa has a variable precision, how does one tell if one has overflowed the mantissa's current precision setting? Or, from the proactive side, how can one tell if mantissa precision overflow is likely?
Kind regards,
Bruce.
There are a few numerical methods to see if you're going to lose precision, but the bottom line is that you need to understand the definition of precision better.
-4939600281397002.2812
and
-4939600281397002.2812000000000000
are NOT the same number.
When you add
-2234.6016114467412141
and
-4939600281397002.2812
together, the correct output will only have 20 digits of precision, because the additional 12 digits in the smaller number are meaningless given that the 12 similarly sized digits in the larger number are unknown. You can imply that they are zero, but if that's the case then you must explicitly declare them to be as such, and use a numbering system that can handle it - the computer is not good at understanding implicit intentions.
As far as detecting when you are going to have this problem, all you need to do is find out if they have the same exponent (assuming a normalized mantissa +/- 1 or similar binary equivalent). If they aren't normalized, then you'll need to normalize them to compare, or use a slightly more complex comparison with the exponent.
Precision and accuracy are not the same thing...
-Adam