C: Adding Exponentials - c

What I thought was a trivial addition in standard C code compiled by GCC has confused me somewhat.
If I have a double called A and also a double called B, and A = a very small exponential say 1e-20 and B is a larger value for example 1e-5 - why does my double C which equals the summation A+B take on the dominant value B? I was hoping that when I specify to print to 25 decimal places I would get 1.00000000000000100000e-5.
Instead what I get is just 1.00000000000000000000e-5. Do I have to use long double or something else?
Very confused, and an easy question for most to answer I'm sure! Thanks for any guidance in advance.

Yes, there is not enough precision in the double mantissa. 2^53 (the precision of the double mantissa) is only slightly larger than 10^15 (the ratio between 10^20 and 10^5) so binary expansion and round off can easily squash small bits at the end.
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Google is your friend etc.

Floating point variables can hold a bigger range of value than fixed point, however their precision on significant digit has limits.
You can represent very big or very small numbers but the precision is dependent on the number of significant digit.
If you try to make operation between numbers very far in terms of exponent used to express them, the ability to work with them depends on the ability to represent them with the same exponent.
In your case when you try to sum the two numbers, the smaller numbers is matched in exponent with the bigger one, resulting in a 0 because its significant digit is out of range.
You can learn more for example on wiki

Related

Calculate square root using integer arithmetic

I want to calculate the square root of some integer without using any floating point arithmetic. The catch, however, is that I don't want to discard precision from the output. That is to say, I do not want a rounded integer as the result, I would like to achieve the decimal point value as well, at least to two significant digits. As an example:
sqrt(9) = 3
sqrt(10) = 3.16
sqrt(999999) = 999.99
I've been thinking about it but I haven't particularly come up with solutions, nor has searching helped much since most similar questions are just that, only similar.
Output is acceptable in any form which is not a floating point number and accurately represents the data. Preferably, I would have two ints, one for the portion before the decimal and one for the portion after the decimal.
I'm okay with just pseudo-code / an explained algorithm, if coding C would be best. Thanks
You can calculate an integer numerator and an integer denominator, such that the floating-point division of the numerator by the denominator will yield the square root of the input number.
Please note, however, that no square-root method exists such that the result is 100% accurate for every natural number, as the square root of such number can be an irrational number.
Here is the algorithm:
Function (input number, input num_of_iterations, output root):
Set root.numerator = number
Set root.denominator = 1
Run num_of_iterations:
Set root = root-(root^2-number)/(root*2)
You might find this C++ implementation useful (it also includes the conversion of the numerator divided by the denominator into a numerical string with predefined floating-point precision).
Please note that no floating-point operations are required (as demonstrated at the given link).

trouble with double truncation and math in C

Im making a functions that fits balls into boxes. the code that computes the number of balls that can fit on each side of the box is below. Assume that the balls fit together as if they were cubes. I know this is not the optimal way but just go with it.
the problem for me is that although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
two additional things, this error only happens when the optimal side length is reached; for example, the side length is 12.2, the box thickness is .1 and the ball radius is 1.5. this leads to exactly 4 balls fit on that side. if I DONT cast as an int, it works out but if I do cast as an int, I get the aforementioned error (31 instead of 32). Also, the print line runs once if the side length is optimal but twice if it's not. I don't know what that means.
double ballsFit(double r, double l, double w, double h, double boxthick)
{
double ballsInL, ballsInW, ballsInH;
int ballsinbox;
ballsInL= (int)((l-(2*boxthick))/(r*2));
ballsInW= (int)((w-(2*boxthick))/(r*2));
ballsInH= (int)((h-(2*boxthick))/(r*2));
ballsinbox=(ballsInL*ballsInW*ballsInH);
printf("LENGTH=%f\nWidth=%f\nHight=%f\nBALLS=%d\n", ballsInL, ballsInW, ballsInH, ballsinbox);
return ballsinbox;
}
The fundamental problem is that floating-point math is inexact.
For example, the number 0.1 -- that you mention as the value of thickness in the problematic example -- cannot be represented exactly as a double. When you assign 0.1 to a variable, what gets stored is an approximation of 0.1.
I recommend that you read What Every Computer Scientist Should Know About Floating-Point Arithmetic.
although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
It is almost certainly the case that the multiplicands (at least some of them) are not what they look like. If they were exactly 4.0, 4.0 and 2.0, their product would be exactly 32.0. If you printed out all the digits that the doubles are capable of representing, I am pretty sure you'd see lots of 9s, as in 3.99999999999... etc. As a consequence, the product is a tiny bit less than 32. The double-to-int conversion simply chops off the fractional part, so you end up with 31.
Of course, you don't always get numbers that are less than what they would be if the computation were exact; you can also get numbers that are greater than what you might expect.
Fixed precision floating point numbers, such as the IEEE-754 numbers commonly used in modern computers cannot represent all decimal numbers accurately - much like 1/3 cannot be represented accurately in decimal.
For example 0.1 can be something along the lines of 0.100000000000000004... when converted to binary and back. The difference is small, but significant.
I have occasionally managed to (partially) deal with such issues by using extended or arbitrary precision arithmetic to maintain a degree of precision while computing and then down-converting to double for the final results. There is usually a noticeable drop in performance, but IMHO correctness is infinitely more important.
I recently used algorithms from the high-precision arithmetic libraries listed here with good results on both the precision and performance fronts.

strtod with base parameter

I don't want to unnecessarily re-invent the wheel, but I have been looking for the functionality of strtod but with a base parameter (2,8,10,16). (I know strtoul allows a base parameter but I'm looking for return type double). Any advice / pointers in the right direction? Thanks.
For arbitrary base, this is a hard problem, but as long as your base is a power of two, the plain naive algorithm will work just fine.
strtod (in C99) supports hex floats in the same format as the C language's hex float constants. 0x prefix is required, p separates the exponent, and the exponent is in base 10 and represents a power of 2. If you need to support pre-C99 libraries, you'll have no such luck. But since you need base 2/4/8 too, it's probably just best to roll your own anyway.
Edit: An outline of the naive algorithm:
Start with a floating point accumulator variable (double or whatever, as you prefer) initialized to 0.
Starting from the leftmost digit, and up to the radix point, for each character you process, multiply the accumulator by the base and add the value of the character as a digit.
After the radix point, start a new running place-value variable, initially 1/base. On each character you process, add the digit value times the place-value variable, and then divide the place-value variable by base.
If you see the exponent character, read the number following it as an integer and use one of the standard library functions to scale a floating point number by a power of 2.
If you want to handle potentially rounding up forms that have too many digits, you have to work out that logic once you exceed the number of significant places in step 2 or 3. Otherwise you can ignore that.
Unlikely - I have never seen floating point numbers coded as 'decimals' in other number bases.

How can I introduce a small number with a lot of significant figures into a C program?

I'm not particularly knowledgable about programming and I'm trying to figure out how to get a precise value calculated in a C program. I need a constant to the power of negative 7, with 5 significant figures. Any suggestions (keeping in mind I know very little, have never programmed in anything but c and only during required courses that I took years ago at school)?
Thanks!
You can get high-precision math from specialized libraries, but if all you need is 5 significant digits then the built-in float and double types will do fine. Let's go with double for maximum precision.
The negative 7th power is just 1 over your number to the 7th power, so...
double k = 1.2345678; // your constant, whatever it is
double ktominus7 = 1.0 / (k * k * k * k * k * k * k);
...and that's it!
If you want to print out the value, you can do something like
printf("My number is: %9.5g\n", ktominus7);
For a constant value, the required calculation is going to be constant too. So, I recommend you calculate the value using your [desktop calculator / MATLAB / other] then hard-code it in your C code.
In the realm of computer floating-point formats, five significant digits is not a lot. The 32-bit IEEE-754 floating-point type used for float in most implementations of C has 24 bits of precision, which is about 7.2 decimal digits. So you can just use floating-point with no fear. double usually has 53 bits of precision (almost 16 decimal digits). Carl Smotricz's answer is fine, but there's also a pow function in C that you can pass -7.0 to.
There are times when you have to be careful about numerical analysis of your algorithm to ensure you aren't losing precision with intermediate results, but this doesn't sound like one of them.
long double offers the best precision in most cases and can be statically allocated and re-used to keep waste to a minimum. See also quadruple precision. Both change from platform to platform. Quadruple precision says the left most bit (1) continues to dictate signedness, while the next 15 bits dictate the exponent. IEEE 754 (i.e binary128) if the links provided aren't enough, they all lead back to long double :)
Simple shifting should take care of the rest, if I understand you correctly?
you can use log to transform small numbers into larger numbers and do your math on the log transformed version. it's kind of tricky but it will work most of the time. you can also switch to python which does not have this problem as much.

How to detect mantissa precision overflow in GMP, before or after it happens?

The question I meant to ask concerned the mantissa, not the exponent, and has lots to do with the question I asked earlier in the week regarding "missing" digits on the sum of two negative floats.
Given that the mantissa has a variable precision, how does one tell if one has overflowed the mantissa's current precision setting? Or, from the proactive side, how can one tell if mantissa precision overflow is likely?
Kind regards,
Bruce.
There are a few numerical methods to see if you're going to lose precision, but the bottom line is that you need to understand the definition of precision better.
-4939600281397002.2812
and
-4939600281397002.2812000000000000
are NOT the same number.
When you add
-2234.6016114467412141
and
-4939600281397002.2812
together, the correct output will only have 20 digits of precision, because the additional 12 digits in the smaller number are meaningless given that the 12 similarly sized digits in the larger number are unknown. You can imply that they are zero, but if that's the case then you must explicitly declare them to be as such, and use a numbering system that can handle it - the computer is not good at understanding implicit intentions.
As far as detecting when you are going to have this problem, all you need to do is find out if they have the same exponent (assuming a normalized mantissa +/- 1 or similar binary equivalent). If they aren't normalized, then you'll need to normalize them to compare, or use a slightly more complex comparison with the exponent.
Precision and accuracy are not the same thing...
-Adam

Resources