Floating point operations with no library - c

I am looking for a efficient way to properly do mathematical operations with floating values. As I am in the embedded C, I don't want to use any extra library for float data type.
As far as I understand, the correct way here would be to treat a floating value as a raw binary(sign, exponent, mantissa), and do the operations like that. But I cannot find any examples on how exactly that works.
I am looking for a explication on how to do the following with no float data type:
Given a variable int x that can have values from 0 to 10000.
y = x * 0.720 + 84.234;
y = y / 2.5;
Thank you for your time internet

Floating point libraries are not required for the example operations you have suggested, and while avoiding floating point code on an embedded system without an FPU is often advisable, doing that by implementing your own floating point encoding will save you nothing and will likely be less efficient, less comprehensible and more error prone than using compiler's built-in FP support.
Instead, you need to avoid floating-point code entirely, and use fixed-point encoding. In many cases that can be done ad-hoc for individual expressions, but if your application is math intensive (involving trig, logs, sqrt, exponentiation for example) you might to choose a fixed-point library or implement your own.
Floating-point dependency is trivially eradicated in the examples you have suggested; for example:
// y = x * 0.720 + 84.234
// Where x_x1000 = real value * 1000
int y_x1000 = (x_x1000 * 720) / 1000 + 84234 ;
or more efficiently using binary-fixed-point and a 10 bit fractional part:
// y = x * 0.720 + 84.234
// Where x_q10 = real value * 1024
int32_t y_q10 = (x_q10 * 737) >> 10 + 86256 ;
Although you might consider int64_t for greater numeric range - in which case you might also use more fractional bits for greater precision too.
If you are doing a lot of intensive fixed-point maths, you would do well to consider a library or implement one using CORDIC algorithms. An example of such a library can be found at https://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html, although it is C++ - the clear advantage being that by defining a fixed class and extensive operator overloading, existing floating-point code can largely be converted to fixed point by replacing double or float keywords with fixed and compiling as C++ - even if the code is otherwise non-OOP and entirely C-like.

Related

How to round 8.475 to 8.48 in C (rounding function that takes into account representation issues)? Reducing probability of issue

I am trying to round 8.475 to 8.48 (to two decimal places in C). The problem is that 8.475 internally is represented as 8.47499999999999964473:
double input_test =8.475;
printf("input tests: %.20f, %.20f \n", input_test, *&input_test);
gives:
input tests: 8.47499999999999964473, 8.47499999999999964473
So, if I had an ideal round function then it would round 8.475=8.4749999... to 8.47. So, internal round function is no appropriate for me. I see that rounding problem arises in cases of "underflow" and therefore I am trying to use the following algorithm:
double MyRound2( double * value) {
double ad;
long long mzr;
double resval;
if ( *value < 0.000000001 )
ad = -0.501;
else
ad = 0.501;
mzr = long long (*value);
resval = *value - mzr;
resval= (long long( resval*100+ad))/100;
return resval;
}
This solves the "underflow" issue and it works well for "overflow" issues as well. The problem is that there are valid values x.xxx99 for which this function incorrectly gives bigger value (because of 0.001 in 0.501). How to solve this issue, how to devise algorithm that can detect floating point representation issue and that can round taking account this issue? Maybe C already has such clever rounding function? Maybe I can select different value for constant ad - such that probability of such rounding errors goes to zero (I mostly work with money values with up to 4 decimal ciphers).
I have read all the popoular articles about floating point representation and I know that there are tricky and unsolvable issues, but my client do not accept such explanation because client can clearly demonstrate that Excel handles (reproduces, rounds and so on) floating point numbers without representation issues.
(The C and C++ standards are intentionally flexible when it comes to the specification of the double type; quite often it is IEEE754 64 bit type. So your observed result is platform-dependent).
You are observing of the pitfalls of using floating point types.
Sadly there isn't an "out-of-the-box" fix for this. (Adding a small constant pre-rounding just pushes the problem to other numbers).
Moral of the story: don't use floating point types for money.
Use a special currency type instead or work in "pence"; using an integral type instead.
By the way, Excel does use an IEEE754 double precision floating point for its number type, but it also has some clever tricks up its sleeve. Essentially it tracks the joke digits carefully and also is clever with its formatting. This is how it can evaluate 1/3 + 1/3 + 1/3 exactly. But even it will get money calculations wrong sometimes.
For financial calculations, it is better to work in base-10 to avoid represenatation issues when going to/from binary. In many countries, financial software is even legally required to do so. Here is one library for IEEE 754R Decimal Floating-Point Arithmetic, have not tried it myself:
http://www.netlib.org/misc/intel/
Also note that working in decimal floating-point instead of fixed-point representation allows clever algoritms like the Kahan summation algorithm, to avoid accumulation of rounding errors. A noteworthy difference to normal floating point is that numbers with few significant digits are not normalized, so you can have e.g both 1*10^2 and .1*10^3.
An implementation note is that one representation in the std uses a binary significand, to allow sw implementations using a standard binary ALU.
How about this one: Define some threshold. This threshold is the distance to the next multiple of 0.005 at which you assume that this distance could be an error of imprecision. Execute appropriate methods if it's within that distance and smaller. Round as usual and at the end, if you detected that it was, add 0.01.
That said, this is only a work around and somewhat of a code smell. If you don't need too much speed, go for some other type than float. Like your own type that works like
class myDecimal{ int digits; int exponent_of_ten; } with value = digits * E exponent_of_ten
I am not trying to argument that using floating point numbers to represent money is advisable - it is not! but sometimes you have no choice... We do kind of work with money (life incurance calculations) and are forced to use floating point numbers for everything including values representing money.
Now there are quite some different rounding behaviours out there: round up, round down, round half up, round half down, round half even, maybe more. It looks like you were after round half up method.
Our round-half-up function - here translated from Java - looks like this:
#include <iostream>
#include <cmath>
#include <cfloat>
using namespace std;
int main()
{
double value = 8.47499999999999964473;
double result = value * pow(10, 2);
result = nextafter(result + (result > 0.0 ? 1e-8 : -1e-8), DBL_MAX);
double integral = floor(result);
double fraction = result - integral;
if (fraction >= 0.5) {
result = ceil(result);
} else {
result = integral;
}
result /= pow(10, 2);
cout << result << endl;
return 0;
}
where nextafter is a function returning the next floating point value after the given value - this code is proved to work using C++11 (AFAIK the nextafter is also available in boost), the result written into the standard output is 8.48.

Trying to compute sine in C using the Taylor series, but I can't get it to print out [duplicate]

I've been poring through .NET disassemblies and the GCC source code, but can't seem to find anywhere the actual implementation of sin() and other math functions... they always seem to be referencing something else.
Can anyone help me find them? I feel like it's unlikely that ALL hardware that C will run on supports trig functions in hardware, so there must be a software algorithm somewhere, right?
I'm aware of several ways that functions can be calculated, and have written my own routines to compute functions using taylor series for fun. I'm curious about how real, production languages do it, since all of my implementations are always several orders of magnitude slower, even though I think my algorithms are pretty clever (obviously they're not).
In GNU libm, the implementation of sin is system-dependent. Therefore you can find the implementation, for each platform, somewhere in the appropriate subdirectory of sysdeps.
One directory includes an implementation in C, contributed by IBM. Since October 2011, this is the code that actually runs when you call sin() on a typical x86-64 Linux system. It is apparently faster than the fsin assembly instruction. Source code: sysdeps/ieee754/dbl-64/s_sin.c, look for __sin (double x).
This code is very complex. No one software algorithm is as fast as possible and also accurate over the whole range of x values, so the library implements several different algorithms, and its first job is to look at x and decide which algorithm to use.
When x is very very close to 0, sin(x) == x is the right answer.
A bit further out, sin(x) uses the familiar Taylor series. However, this is only accurate near 0, so...
When the angle is more than about 7°, a different algorithm is used, computing Taylor-series approximations for both sin(x) and cos(x), then using values from a precomputed table to refine the approximation.
When |x| > 2, none of the above algorithms would work, so the code starts by computing some value closer to 0 that can be fed to sin or cos instead.
There's yet another branch to deal with x being a NaN or infinity.
This code uses some numerical hacks I've never seen before, though for all I know they might be well-known among floating-point experts. Sometimes a few lines of code would take several paragraphs to explain. For example, these two lines
double t = (x * hpinv + toint);
double xn = t - toint;
are used (sometimes) in reducing x to a value close to 0 that differs from x by a multiple of π/2, specifically xn × π/2. The way this is done without division or branching is rather clever. But there's no comment at all!
Older 32-bit versions of GCC/glibc used the fsin instruction, which is surprisingly inaccurate for some inputs. There's a fascinating blog post illustrating this with just 2 lines of code.
fdlibm's implementation of sin in pure C is much simpler than glibc's and is nicely commented. Source code: fdlibm/s_sin.c and fdlibm/k_sin.c
Functions like sine and cosine are implemented in microcode inside microprocessors. Intel chips, for example, have assembly instructions for these. A C compiler will generate code that calls these assembly instructions. (By contrast, a Java compiler will not. Java evaluates trig functions in software rather than hardware, and so it runs much slower.)
Chips do not use Taylor series to compute trig functions, at least not entirely. First of all they use CORDIC, but they may also use a short Taylor series to polish up the result of CORDIC or for special cases such as computing sine with high relative accuracy for very small angles. For more explanation, see this StackOverflow answer.
OK kiddies, time for the pros....
This is one of my biggest complaints with inexperienced software engineers. They come in calculating transcendental functions from scratch (using Taylor's series) as if nobody had ever done these calculations before in their lives. Not true. This is a well defined problem and has been approached thousands of times by very clever software and hardware engineers and has a well defined solution.
Basically, most of the transcendental functions use Chebyshev Polynomials to calculate them. As to which polynomials are used depends on the circumstances. First, the bible on this matter is a book called "Computer Approximations" by Hart and Cheney. In that book, you can decide if you have a hardware adder, multiplier, divider, etc, and decide which operations are fastest. e.g. If you had a really fast divider, the fastest way to calculate sine might be P1(x)/P2(x) where P1, P2 are Chebyshev polynomials. Without the fast divider, it might be just P(x), where P has much more terms than P1 or P2....so it'd be slower. So, first step is to determine your hardware and what it can do. Then you choose the appropriate combination of Chebyshev polynomials (is usually of the form cos(ax) = aP(x) for cosine for example, again where P is a Chebyshev polynomial). Then you decide what decimal precision you want. e.g. if you want 7 digits precision, you look that up in the appropriate table in the book I mentioned, and it will give you (for precision = 7.33) a number N = 4 and a polynomial number 3502. N is the order of the polynomial (so it's p4.x^4 + p3.x^3 + p2.x^2 + p1.x + p0), because N=4. Then you look up the actual value of the p4,p3,p2,p1,p0 values in the back of the book under 3502 (they'll be in floating point). Then you implement your algorithm in software in the form:
(((p4.x + p3).x + p2).x + p1).x + p0
....and this is how you'd calculate cosine to 7 decimal places on that hardware.
Note that most hardware implementations of transcendental operations in an FPU usually involve some microcode and operations like this (depends on the hardware).
Chebyshev polynomials are used for most transcendentals but not all. e.g. Square root is faster to use a double iteration of Newton raphson method using a lookup table first.
Again, that book "Computer Approximations" will tell you that.
If you plan on implmementing these functions, I'd recommend to anyone that they get a copy of that book. It really is the bible for these kinds of algorithms.
Note that there are bunches of alternative means for calculating these values like cordics, etc, but these tend to be best for specific algorithms where you only need low precision. To guarantee the precision every time, the chebyshev polynomials are the way to go. Like I said, well defined problem. Has been solved for 50 years now.....and thats how it's done.
Now, that being said, there are techniques whereby the Chebyshev polynomials can be used to get a single precision result with a low degree polynomial (like the example for cosine above). Then, there are other techniques to interpolate between values to increase the accuracy without having to go to a much larger polynomial, such as "Gal's Accurate Tables Method". This latter technique is what the post referring to the ACM literature is referring to. But ultimately, the Chebyshev Polynomials are what are used to get 90% of the way there.
Enjoy.
For sin specifically, using Taylor expansion would give you:
sin(x) := x - x^3/3! + x^5/5! - x^7/7! + ... (1)
you would keep adding terms until either the difference between them is lower than an accepted tolerance level or just for a finite amount of steps (faster, but less precise). An example would be something like:
float sin(float x)
{
float res=0, pow=x, fact=1;
for(int i=0; i<5; ++i)
{
res+=pow/fact;
pow*=-1*x*x;
fact*=(2*(i+1))*(2*(i+1)+1);
}
return res;
}
Note: (1) works because of the aproximation sin(x)=x for small angles. For bigger angles you need to calculate more and more terms to get acceptable results.
You can use a while argument and continue for a certain accuracy:
double sin (double x){
int i = 1;
double cur = x;
double acc = 1;
double fact= 1;
double pow = x;
while (fabs(acc) > .00000001 && i < 100){
fact *= ((2*i)*(2*i+1));
pow *= -1 * x*x;
acc = pow / fact;
cur += acc;
i++;
}
return cur;
}
Concerning trigonometric function like sin(), cos(),tan() there has been no mention, after 5 years, of an important aspect of high quality trig functions: Range reduction.
An early step in any of these functions is to reduce the angle, in radians, to a range of a 2*π interval. But π is irrational so simple reductions like x = remainder(x, 2*M_PI) introduce error as M_PI, or machine pi, is an approximation of π. So, how to do x = remainder(x, 2*π)?
Early libraries used extended precision or crafted programming to give quality results but still over a limited range of double. When a large value was requested like sin(pow(2,30)), the results were meaningless or 0.0 and maybe with an error flag set to something like TLOSS total loss of precision or PLOSS partial loss of precision.
Good range reduction of large values to an interval like -π to π is a challenging problem that rivals the challenges of the basic trig function, like sin(), itself.
A good report is Argument reduction for huge arguments: Good to the last bit (1992). It covers the issue well: discusses the need and how things were on various platforms (SPARC, PC, HP, 30+ other) and provides a solution algorithm the gives quality results for all double from -DBL_MAX to DBL_MAX.
If the original arguments are in degrees, yet may be of a large value, use fmod() first for improved precision. A good fmod() will introduce no error and so provide excellent range reduction.
// sin(degrees2radians(x))
sin(degrees2radians(fmod(x, 360.0))); // -360.0 < fmod(x,360) < +360.0
Various trig identities and remquo() offer even more improvement. Sample: sind()
Yes, there are software algorithms for calculating sin too. Basically, calculating these kind of stuff with a digital computer is usually done using numerical methods like approximating the Taylor series representing the function.
Numerical methods can approximate functions to an arbitrary amount of accuracy and since the amount of accuracy you have in a floating number is finite, they suit these tasks pretty well.
Use Taylor series and try to find relation between terms of the series so you don't calculate things again and again
Here is an example for cosinus:
double cosinus(double x, double prec)
{
double t, s ;
int p;
p = 0;
s = 1.0;
t = 1.0;
while(fabs(t/s) > prec)
{
p++;
t = (-t * x * x) / ((2 * p - 1) * (2 * p));
s += t;
}
return s;
}
using this we can get the new term of the sum using the already used one (we avoid the factorial and x2p)
It is a complex question. Intel-like CPU of the x86 family have a hardware implementation of the sin() function, but it is part of the x87 FPU and not used anymore in 64-bit mode (where SSE2 registers are used instead). In that mode, a software implementation is used.
There are several such implementations out there. One is in fdlibm and is used in Java. As far as I know, the glibc implementation contains parts of fdlibm, and other parts contributed by IBM.
Software implementations of transcendental functions such as sin() typically use approximations by polynomials, often obtained from Taylor series.
Chebyshev polynomials, as mentioned in another answer, are the polynomials where the largest difference between the function and the polynomial is as small as possible. That is an excellent start.
In some cases, the maximum error is not what you are interested in, but the maximum relative error. For example for the sine function, the error near x = 0 should be much smaller than for larger values; you want a small relative error. So you would calculate the Chebyshev polynomial for sin x / x, and multiply that polynomial by x.
Next you have to figure out how to evaluate the polynomial. You want to evaluate it in such a way that the intermediate values are small and therefore rounding errors are small. Otherwise the rounding errors might become a lot larger than errors in the polynomial. And with functions like the sine function, if you are careless then it may be possible that the result that you calculate for sin x is greater than the result for sin y even when x < y. So careful choice of the calculation order and calculation of upper bounds for the rounding error are needed.
For example, sin x = x - x^3/6 + x^5 / 120 - x^7 / 5040... If you calculate naively sin x = x * (1 - x^2/6 + x^4/120 - x^6/5040...), then that function in parentheses is decreasing, and it will happen that if y is the next larger number to x, then sometimes sin y will be smaller than sin x. Instead, calculate sin x = x - x^3 * (1/6 - x^2 / 120 + x^4/5040...) where this cannot happen.
When calculating Chebyshev polynomials, you usually need to round the coefficients to double precision, for example. But while a Chebyshev polynomial is optimal, the Chebyshev polynomial with coefficients rounded to double precision is not the optimal polynomial with double precision coefficients!
For example for sin (x), where you need coefficients for x, x^3, x^5, x^7 etc. you do the following: Calculate the best approximation of sin x with a polynomial (ax + bx^3 + cx^5 + dx^7) with higher than double precision, then round a to double precision, giving A. The difference between a and A would be quite large. Now calculate the best approximation of (sin x - Ax) with a polynomial (b x^3 + cx^5 + dx^7). You get different coefficients, because they adapt to the difference between a and A. Round b to double precision B. Then approximate (sin x - Ax - Bx^3) with a polynomial cx^5 + dx^7 and so on. You will get a polynomial that is almost as good as the original Chebyshev polynomial, but much better than Chebyshev rounded to double precision.
Next you should take into account the rounding errors in the choice of polynomial. You found a polynomial with minimum error in the polynomial ignoring rounding error, but you want to optimise polynomial plus rounding error. Once you have the Chebyshev polynomial, you can calculate bounds for the rounding error. Say f (x) is your function, P (x) is the polynomial, and E (x) is the rounding error. You don't want to optimise | f (x) - P (x) |, you want to optimise | f (x) - P (x) +/- E (x) |. You will get a slightly different polynomial that tries to keep the polynomial errors down where the rounding error is large, and relaxes the polynomial errors a bit where the rounding error is small.
All this will get you easily rounding errors of at most 0.55 times the last bit, where +,-,*,/ have rounding errors of at most 0.50 times the last bit.
The actual implementation of library functions is up to the specific compiler and/or library provider. Whether it's done in hardware or software, whether it's a Taylor expansion or not, etc., will vary.
I realize that's absolutely no help.
There's nothing like hitting the source and seeing how someone has actually done it in a library in common use; let's look at one C library implementation in particular. I chose uLibC.
Here's the sin function:
http://git.uclibc.org/uClibc/tree/libm/s_sin.c
which looks like it handles a few special cases, and then carries out some argument reduction to map the input to the range [-pi/4,pi/4], (splitting the argument into two parts, a big part and a tail) before calling
http://git.uclibc.org/uClibc/tree/libm/k_sin.c
which then operates on those two parts.
If there is no tail, an approximate answer is generated using a polynomial of degree 13.
If there is a tail, you get a small corrective addition based on the principle that sin(x+y) = sin(x) + sin'(x')y
They are typically implemented in software and will not use the corresponding hardware (that is, aseembly) calls in most cases. However, as Jason pointed out, these are implementation specific.
Note that these software routines are not part of the compiler sources, but will rather be found in the correspoding library such as the clib, or glibc for the GNU compiler. See http://www.gnu.org/software/libc/manual/html_mono/libc.html#Trig-Functions
If you want greater control, you should carefully evaluate what you need exactly. Some of the typical methods are interpolation of look-up tables, the assembly call (which is often slow), or other approximation schemes such as Newton-Raphson for square roots.
If you want an implementation in software, not hardware, the place to look for a definitive answer to this question is Chapter 5 of Numerical Recipes. My copy is in a box, so I can't give details, but the short version (if I remember this right) is that you take tan(theta/2) as your primitive operation and compute the others from there. The computation is done with a series approximation, but it's something that converges much more quickly than a Taylor series.
Sorry I can't rembember more without getting my hand on the book.
Whenever such a function is evaluated, then at some level there is most likely either:
A table of values which is interpolated (for fast, inaccurate applications - e.g. computer graphics)
The evaluation of a series that converges to the desired value --- probably not a taylor series, more likely something based on a fancy quadrature like Clenshaw-Curtis.
If there is no hardware support then the compiler probably uses the latter method, emitting only assembler code (with no debug symbols), rather than using a c library --- making it tricky for you to track the actual code down in your debugger.
If you want to look at the actual GNU implementation of those functions in C, check out the latest trunk of glibc. See the GNU C Library.
As many people pointed out, it is implementation dependent. But as far as I understand your question, you were interested in a real software implemetnation of math functions, but just didn't manage to find one. If this is the case then here you are:
Download glibc source code from http://ftp.gnu.org/gnu/glibc/
Look at file dosincos.c located in unpacked glibc root\sysdeps\ieee754\dbl-64 folder
Similarly you can find implementations of the rest of the math library, just look for the file with appropriate name
You may also have a look at the files with the .tbl extension, their contents is nothing more than huge tables of precomputed values of different functions in a binary form. That is why the implementation is so fast: instead of computing all the coefficients of whatever series they use they just do a quick lookup, which is much faster. BTW, they do use Tailor series to calculate sine and cosine.
I hope this helps.
I'll try to answer for the case of sin() in a C program, compiled with GCC's C compiler on a current x86 processor (let's say a Intel Core 2 Duo).
In the C language the Standard C Library includes common math functions, not included in the language itself (e.g. pow, sin and cos for power, sine, and cosine respectively). The headers of which are included in math.h.
Now on a GNU/Linux system, these libraries functions are provided by glibc (GNU libc or GNU C Library). But the GCC compiler wants you to link to the math library (libm.so) using the -lm compiler flag to enable usage of these math functions. I'm not sure why it isn't part of the standard C library. These would be a software version of the floating point functions, or "soft-float".
Aside: The reason for having the math functions separate is historic, and was merely intended to reduce the size of executable programs in very old Unix systems, possibly before shared libraries were available, as far as I know.
Now the compiler may optimize the standard C library function sin() (provided by libm.so) to be replaced with an call to a native instruction to your CPU/FPU's built-in sin() function, which exists as an FPU instruction (FSIN for x86/x87) on newer processors like the Core 2 series (this is correct pretty much as far back as the i486DX). This would depend on optimization flags passed to the gcc compiler. If the compiler was told to write code that would execute on any i386 or newer processor, it would not make such an optimization. The -mcpu=486 flag would inform the compiler that it was safe to make such an optimization.
Now if the program executed the software version of the sin() function, it would do so based on a CORDIC (COordinate Rotation DIgital Computer) or BKM algorithm, or more likely a table or power-series calculation which is commonly used now to calculate such transcendental functions. [Src: http://en.wikipedia.org/wiki/Cordic#Application]
Any recent (since 2.9x approx.) version of gcc also offers a built-in version of sin, __builtin_sin() that it will used to replace the standard call to the C library version, as an optimization.
I'm sure that is as clear as mud, but hopefully gives you more information than you were expecting, and lots of jumping off points to learn more yourself.
Don't use Taylor series. Chebyshev polynomials are both faster and more accurate, as pointed out by a couple of people above. Here is an implementation (originally from the ZX Spectrum ROM): https://albertveli.wordpress.com/2015/01/10/zx-sine/
Computing sine/cosine/tangent is actually very easy to do through code using the Taylor series. Writing one yourself takes like 5 seconds.
The whole process can be summed up with this equation here:
Here are some routines I wrote for C:
double _pow(double a, double b) {
double c = 1;
for (int i=0; i<b; i++)
c *= a;
return c;
}
double _fact(double x) {
double ret = 1;
for (int i=1; i<=x; i++)
ret *= i;
return ret;
}
double _sin(double x) {
double y = x;
double s = -1;
for (int i=3; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _cos(double x) {
double y = 1;
double s = -1;
for (int i=2; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _tan(double x) {
return (_sin(x)/_cos(x));
}
Improved version of code from Blindy's answer
#define EPSILON .0000000000001
// this is smallest effective threshold, at least on my OS (WSL ubuntu 18)
// possibly because factorial part turns 0 at some point
// and it happens faster then series element turns 0;
// validation was made against sin() from <math.h>
double ft_sin(double x)
{
int k = 2;
double r = x;
double acc = 1;
double den = 1;
double num = x;
// precision drops rapidly when x is not close to 0
// so move x to 0 as close as possible
while (x > PI)
x -= PI;
while (x < -PI)
x += PI;
if (x > PI / 2)
return (ft_sin(PI - x));
if (x < -PI / 2)
return (ft_sin(-PI - x));
// not using fabs for performance reasons
while (acc > EPSILON || acc < -EPSILON)
{
num *= -x * x;
den *= k * (k + 1);
acc = num / den;
r += acc;
k += 2;
}
return (r);
}
The essence of how it does this lies in this excerpt from Applied Numerical Analysis by Gerald Wheatley:
When your software program asks the computer to get a value of
or , have you wondered how it can get the
values if the most powerful functions it can compute are polynomials?
It doesnt look these up in tables and interpolate! Rather, the
computer approximates every function other than polynomials from some
polynomial that is tailored to give the values very accurately.
A few points to mention on the above is that some algorithms do infact interpolate from a table, albeit only for the first few iterations. Also note how it mentions that computers utilise approximating polynomials without specifying which type of approximating polynomial. As others in the thread have pointed out, Chebyshev polynomials are more efficient than Taylor polynomials in this case.
if you want sin then
__asm__ __volatile__("fsin" : "=t"(vsin) : "0"(xrads));
if you want cos then
__asm__ __volatile__("fcos" : "=t"(vcos) : "0"(xrads));
if you want sqrt then
__asm__ __volatile__("fsqrt" : "=t"(vsqrt) : "0"(value));
so why use inaccurate code when the machine instructions will do?

c: change variable type without casting

I'm changing an uint32_t to a float but without changing the actual bits.
Just to be sure: I don't wan't to cast it. So float f = (float) i is the exact opposite of what I wan't to do because it changes bits.
I'm going to use this to convert my (pseudo) random numbers to float without doing unneeded math.
What I'm currently doing and what is already working is this:
float random_float( uint64_t seed ) {
// Generate random and change bit format to ieee
uint32_t asInt = (random_int( seed ) & 0x7FFFFF) | (0x7E000000>>1);
// Make it a float
return *(float*)(void*)&asInt; // <-- pretty ugly and nees a variable
}
The Question: Now I'd like to get rid of the asInt variable and I'd like to know if there is a better / not so ugly way then getting the address of this variable, casting it twice and dereferencing it again?
You could try union - as long as you make sure the types are identical in memory sizes:
union convertor {
int asInt;
float asFloat;
};
Then you can assign your int to asFloat (or the other way around if you want to). I use it a lot when I need to do bitwise operations on one hand and still get a uint32_t representation on the number on the other hand
[EDIT]
Like many of the commentators rightfully state, you must take into consideration values that are not presentable by integers like like NAN, +INF, -INF, +0, -0.
So you seem to want to generate floating point numbers between 0.5 and 1.0 judging from your code.
Assuming that your microcontroller has a standard C library with floating point support, you can do this all standards compliant without actually involving any floating point operations, all you need is the ldexp function that itself doesn't actually do any floating point math.
This would look something like this:
return ldexpf((1 << 23) + random_thing_smaller_than_23_bits(), -24);
The trick here is that we happen to know that IEEE754 binary32 floating point numbers have integer precision between 2^23 and 2^24 (I could be off-by-one here, double check please, I'm translating this from some work I've done on doubles). So the compiler should know how to convert that number to a float trivially. Then ldexp multiplies that number by 2^-24 by just changing the bits in the exponent. No actual floating point operations involved and no undefined behavior, the code is fully portable to any standard C implementation with IEEE754 numbers. Double check the generated code, but a good compiler and c library should not use any floating point instructions here.
If you want to peek at some experiments I've done around generating random floating point numbers you can peek at this github repo. It's all about doubles, but should be trivially translatable to floats.
Reinterpreting the binary representation of an int to a float would result in major problems:
There are a lot of undefined codes in the binary representation of a float.
Other codes represent special conditions, like NAN, +INF, -INF, +0, -0 (sic!), etc.
Also, if that is a random value, even if catching all non-value representations, that would yield a very bad random distribution.
If you are working on an MCU without FPU, you should better think about avoiding float at all. An alternative might be fraction or scaled integers. There are many implementations of algorithms which use float, but can be easily converted to fixed point types with acceptable loss of precision (or even none at all). Some might even yield more precision than float (note that single precision float has only 23 bits of mantissa, an int32 would have 31 bits (+ 1 sign for either), same for a fractional or fixed scaled int.
Note that C11 added (optional) support for _Frac. You might want to research on that.
Edit:
According you your comments, you seem to convert the int to a float in range 0..<1. For that, you can assemble the float using bit operations on an uint32_t (e.g. the original value). You just need to follow the IEEE format (presumed your toolchain does comply to the C standard! See wikipedia.
The result (still uint32_t) can then be reinterpreted by a union or pointer as described by others already. Pack that in a system-dependent, well-commented library and dig it deep. Do not forget to check about endianess and alignment (likely both the same for float and uint32_t, but important for the bit-ops).

How to avoid floating point round off error in unit tests?

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
}
}
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.
Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.

Fast inverse square of double in C/C++

Recently I was profiling a program in which the hotspot is definitely this
double d = somevalue();
double d2=d*d;
double c = 1.0/d2 // HOT SPOT
The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.
I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)
The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)
Are you sure you need double precision? You can sacrifice precision easily enough:
double d = somevalue();
float c = 1.0f / ((float) d * (float) d);
The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.
Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)
If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)
I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).
The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.
Tricks that don't work
Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.
Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.
Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.
Update
Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.
static double invsq(double x)
{
double y;
int i;
__asm__ (
"cvtpd2ps %1, %0\n\t"
"rcpss %0, %0\n\t"
"cvtps2pd %0, %0"
: "=x"(y)
: "x"(x));
for (i = 0; i < RECIP_ITER; ++i)
y *= 2 - x * y;
return y * y;
}
Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.
I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.
For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)
I suspect the RCPSS method shines if you want to compute several of these in parallel.
Yes, you can certainly try and work something out. Let me just give you some general ideas, you can fill in the details.
First, let's see why Carmack's root works:
We write x = M × 2E in the usual way. Now recall that the IEEE float stores the exponent offset by a bias: If e denoted the exponent field, we have e = Bias + E ≥ 0. Rearranging, we get E = e − Bias.
Now for the inverse square root: x−1/2 = M-1/2 × 2−E/2. The new exponent field is:
e' = Bias − E/2 = 3/2 Bias − e/2
With bit fiddling, we can get the value e/2 from e by shifting, and 3/2 Bias is just a constant.
Moreover, the mantissa M is stored as 1.0 + x with x < 1, and we can approximate M-1/2 as 1 + x/2. Again, the fact that only x is stored in binary means that we get the division by two by simple bit shifting.
Now we look at x−2: this is equal to M−2 × 2−2 E, and we are looking for an exponent field:
e' = Bias − 2 E = 3 Bias − 2 e
Again, 3 Bias is just a constant, and you can get 2 e from e by bitshifting. As for the mantissa, you can approximate (1 + x)−2 by 1 − 2 x, and so the problem reduces to obtaining 2 x from x.
Note that Carmack's magic floating point fiddling doesn't actually compute the result right aaway: Rather, it produces a remarkably accurate estimate, which is used as the starting point for a traditional, iterative computation. But because the estimate is so good, you only need very few rounds of subsequent iteration to get an acceptable result.
For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.
Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?

Resources