Calculate the power of a number quickly (MPFR library) - c

I am using the GMP and MPFR libraries to work with large numbers and I need to calculate the power of a number quickly. The result of the potentiation will always be an integer, but the potency may or may not be a floating number. the GMP library calculates powers very quickly but does not accept floating powers (Using the mpz_pow_ui function), the MPFR library accepts floating powers but is extremely slow as it requires high precision to calculate integers correctly (using the function mpfr_pow).
Is there any solution for this? How can GMP accept floating powers, or MPFR calculate whole numbers quickly (and correctly)?
//Ex:
mpz_pow_ui(mpz_power, base, 4790) //Fast
// power = 4790.60
mpfr_pow(mpfr_power, base, power, MPFR_RNDN) //Slow

The mpz_pow_ui function is fast since the computation can be done with multiplications. Note that mpfr_pow_ui with MPFR_RNDN should be faster if you don't need the exact integer result (which will be huge if the exponent is large), as the multiplications can be done with a smaller precision and rounded.
Unless the exponent is a not-too-large integer, mpfr_pow will be slower because it needs to compute a logarithm and an exponential, which are more complex than multiplications. I don't think that you can avoid it, even if you know that the result will be an integer. But if you know that the result will be an integer, you can compute it with an error less than 1/2. Thus, with an error analysis, you may be able to reduce the precision of the input and output variables, so that mpfr_pow will be faster.
Note: You are saying
the MPFR library accepts floating powers but is extremely slow as it requires high precision to calculate integers correctly (using the function mpfr_pow)
This is not true. MPFR will carefully chose the intermediate precisions to provide an answer with the required accuracy (though it may sometimes do wrong choices).
However, if the exact result is very close to a "machine number", there may be an issue due to the fact that MPFR needs to return the sign of the error (faithful rounding MPFR_RNDF instead of MPFR_RNDN could help, but it is not optimized yet for mpfr_pow): the Table Maker's Dilemma will occur, requiring internal computations in a higher precision. But if you have carefully chosen the precisions of the inputs, this is unlikely to occur in your case; indeed, reducing the precisions of the input will tend to add some error to the exact result, and this will move the exact result away from the expected integer (by exact result, I mean the exact result with the approximate inputs). You can also use some tricks. For instance, if you know that your integer is not a perfect square, then instead of computing xy, you can compute xy/2 (which does not correspond to a machine number, thus should not have an issue due to the Table Maker's Dilemma), then square the result.

It all depends on the order of magnitude of the numbers you are calculating with. The following assumes both the base and the exponent are positive non-complex numbers.
First note that exponentiation of an integer b by a decimal number like 4790.60 can be rewritten like this (just can't beleive I can't write Latex-style math equations in StackOverflow):
b ^ 4790.60 = b ^ (4790 + 0.60) = (b ^ 4790) * (b ^ 0.60)
Then, the first term (b ^ 4790) can clearly be calculated with GMP and results in an integer value.
The second term has an exponent smaller than 1, so its value will be smaller than b. If b is not a huge integer (say << FLINTMAX, the largest integer that can be represented consecutively in a floating point value, is 2^53 for double), then you can use the native double pow function to calculate it and safely round it to integer as desired, and then multiply by the first term. If b is a huge integer, you have too options: convert it to a double if it's within the double range and then use the double pow function (and perhaps loose some precision in the resulting integer with a speed gain), or you can use the mpfr_pow function to calculate this second term, but noting that it will be an smaller number than b, and so you can adjust MPFR precision in that calculation according to the exponent b: if b is closer to zero use a small precision, if b is close to 1 use a precision not too smaller than that of b.
On the end, all of it is a tradeoff between accuracy and speed, given the machine resource limits you are bound to.

Related

Handling Decimals on Embedded C

I have my code below and I want to ask what's the best way in solving numbers (division, multiplication, logarithm, exponents) up to 4 decimals places? I'm using PIC16F1789 as my device.
float sensorValue;
float sensorAverage;
void main(){
//Get an average data by testing 100 times
for(int x = 0; x < 100; x++){
// Get the total sum of all 100 data
sensorValue = (sensorValue + ADC_GetConversion(SENSOR));
}
// Get the average
sensorAverage = sensorValue/100.0;
}
In general, on MCUs, floating point types are more costly (clocks, code) to process than integer types. While this is often true for devices which have a hardware floating point unit, it becomes a vital information on devices without, like the PIC16/18 controllers. These have to emulate all floating point operations in software. This can easily cost >100 clock cycles per addition (much more for multiplication) and bloats the code.
So, best is to avoid float (not to speak of double on such systems.
For your example, the ADC returns an integer type anyway, so the summation can be done purely with integer types. You just have to make sure the summand does not overflow, so it has to hold ~100 * for your code.
Finally, to calculate the average, you can either divide the integer by the number of iterations (round to zero), or - better - apply a simple "round to nearest" by:
#define NUMBER_OF_ITERATIONS 100
sensorAverage = (sensorValue + NUMBER_OF_ITERATIONS / 2) / NUMBER_OF_ITERATIONS;
If you really want to speed up your code, set NUMBER_OF_ITERATIONS to a power of two (64 or 128 here), if your code can tolerate this.
Finally: To get not only the integer part of the division, you can treat the sum (sensoreValue) as a fractional value. For the given 100 iterations, you can treat it as decimal fraction: when converting to a string, just print a decimal point left of the lower 2 digits. As you divide by 100, there will be no more than two significal digits of decimal fraction. If you really need 4 digits, e.g. for other operations, you can multiply the sum by 100 (actually, it is 10000, but you already have multipiled it by 100 by the loop).
This is called decimal fixed point. Faster for processing (replaces multiplication by shifts) would be to use binary fixed point, as I stated above.
On PIC16, I would strongly suggest to think of using binary fraction as multiplication and division are very costly on this platform. In general, they are not well suited for signal processing. If you need to sustain some performance, an ARM Cortex-M0 or M4 would be the far better choice - at similar prices.
In your example it is trivial to avoid non-integer representations altogether, however to answer your question more generally an ISO compliant compiler will support floating point arithmetic and the library, but for performance and code size reasons you may want to avoid that.
Fixed-point arithmetic is what you probably need. For simple calculations an ad-hoc approach to fixed point can be used whereby for example you treat the units of sensorAverage in your example as hundredths (1/100), and avoid the expensive division altogether. However if you want to perform full maths library operations, then a better approach is to use a fixed-point library. One such library is presented in Optimizing Applications with Fixed-Point Arithmetic by Anthony Williams. The code is C++ and PIC16 may lack a decent C++ compiler, but the methods can be ported somewhat less elegantly to C. It also uses a huge 64bit fixed-point 36Q28 format, which would be expensive and slow on PIC16; you might want to adapt it to use 16Q16 perhaps.
If you are really concerned about performance, stick to integer arithmetics, try to make the number of samples to average a power of two so the division can be made by means of bit shifts, however if it is not a power of two lets say 100 (as Olaf point out for fixed point) you can also use bit shifts and additions: How can I multiply and divide using only bit shifting and adding?
If you are not concerned about performace and still want to work with floats (you already got warned this may not be very fast in a PIC16 and may use a lot of flash), math.h has the following functions: http://en.cppreference.com/w/c/numeric/math including exponeciation: pow(base,exp) and logarithms* only base 2, base 10 and base e, for arbitrary base use the change of base logarithmic property

Float precision

Due to precision of the microcontroller, I defined a symbol containing ratio of two flotants numbers, instead of writing the result directly.
#define INTERVAL (0.01F/0.499F)
instead of
#define INTERVAL 0.02004008016032064F
But the first solution add an other operation "/". If we reason by optimization and correct result, what is the best solution?
They are the same, your compiler will evaluate 0.01F/0.499F at compile-time.
There is a mistake in your constant value 0.01F/0.499F = 0.02004008016032064F.
0.01F/0.499F is evaluated at compile time. The precision used at compile time depends on the compiler and likely exceeds the micro-controller's. Thus either approach will typically provide the same code.
In the unlikelihood the compiler's precision is about the same as the micro-controller's float and typical binary floating-point, the values 0.01F and 0.499F will not be exact but within 0.5 ULP (unit in the last place). The quotient 0.01F/0.499F will be then within about sqrt(2)*0.5 ULP. Using 0.02004008016032064F will be within 0.5 ULP. So under select situations, the constant will be better than the quotient.
Under more rare circumstances, a float precision will be more than 0.02004008016032064F and the quotient would be better.
In the end, recommend coding to whatever values are used to drive the equation. e.g. If 0.01 0.499 are the value of two resistors, use those 2 values.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

upper bound for the floating point error for a number

There are many questions (and answers) on this subject, but I am too thick to figure it out. In C, for a floating point of a given type, say double:
double x;
scanf("%lf", &x);
Is there a generic way to calculate an upper bound (as small as possible) for the error between the decimal fraction string passed to scanf and the internal representation of what is now in x?
If I understand correctly, there is sometimes going to be an error, and it will increase as the absolute value of the decimal fraction increases (in other words, 0.1 will be a bit off, but 100000000.1 will be off by much more).
This aspect of the C standard is slightly under-specified, but you can expect the conversion from decimal to double to be within one Unit in the Last Place of the original.
You seem to be looking for a bound on the absolute error of the conversion. With the above assumption, you can compute such a bound as a double as DBL_EPSILON * x. DBL_EPSILON is typically 2^-52.
A tighter bound on the error that can have been made during the conversion can be computed as follows:
double va = fabs(x);
double error = nextafter(va, +0./0.) - va;
The best conversion functions guarantee conversion to half a ULP in default round-to-nearest mode. If you are using conversion functions with this guarantee, you can divide the bound I offer by two.
The above applies when the original number represented in decimal is 0 or when its absolute value is comprised between DBL_MIN (approx. 2*10^-308) and DBL_MAX (approx. 2*10^308). If the non-null decimal number's absolute value is lower than DBL_MIN, then the absolute error is only bounded by DBL_MIN * DBL_EPSILON. If the absolute value is higher than DBL_MAX, you are likely to get infinity as the result of the conversion.
you cant think of this in terms of base 10, the error is in base 2, which wont necessarily point to a specific decimal place in base 10.
You have two underlying issues with your question, first scanf taking an ascii string and converting it to a binary number, that is one piece of software which uses a number of C libraries. I have seen for example compile time parsing vs runtime parsing give different conversion results on the same system. so in terms of error, if you want an exact number convert it yourself and place that binary number in the register/variable, otherwise accept what you get with the conversion and understand there may be rounding or clipping on the conversion that you didnt expect (which results in an accuracy issue, you didnt get the number you expected).
the second and real problem Pascal already answered. you only have x number if binary places. In terms of decimal if you had 3 decimal places the number 1.2345 would either have to be represented as 1.234 or 1.235. same for binary if you have 3 bits of mantissa then 1.0011 is either 1.001 or 1.010 depending on rounding. the mantissa length for IEEE floating point numbers is well documented you can simply google to find how many binary places you have for each precision.

C: Adding Exponentials

What I thought was a trivial addition in standard C code compiled by GCC has confused me somewhat.
If I have a double called A and also a double called B, and A = a very small exponential say 1e-20 and B is a larger value for example 1e-5 - why does my double C which equals the summation A+B take on the dominant value B? I was hoping that when I specify to print to 25 decimal places I would get 1.00000000000000100000e-5.
Instead what I get is just 1.00000000000000000000e-5. Do I have to use long double or something else?
Very confused, and an easy question for most to answer I'm sure! Thanks for any guidance in advance.
Yes, there is not enough precision in the double mantissa. 2^53 (the precision of the double mantissa) is only slightly larger than 10^15 (the ratio between 10^20 and 10^5) so binary expansion and round off can easily squash small bits at the end.
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Google is your friend etc.
Floating point variables can hold a bigger range of value than fixed point, however their precision on significant digit has limits.
You can represent very big or very small numbers but the precision is dependent on the number of significant digit.
If you try to make operation between numbers very far in terms of exponent used to express them, the ability to work with them depends on the ability to represent them with the same exponent.
In your case when you try to sum the two numbers, the smaller numbers is matched in exponent with the bigger one, resulting in a 0 because its significant digit is out of range.
You can learn more for example on wiki

Resources